Medical data mining in sentiment analysis based on optimized swarm search feature selection

  • Daohui Zeng
  • Jidong Peng
  • Simon Fong
  • Yining Qiu
  • Raymond Wong
Special Issue Article


In this paper, we propose a novel technique termed as optimized swarm search-based feature selection (OS-FS), which is a swarm-type of searching function that selects an ideal subset of features for enhanced classification accuracy. In terms of gaining insights from unstructured medical based texts, sentiment prediction is becoming an increasingly crucial machine learning technique. In fact, due to its robustness and accuracy, it recently gained popularity in the medical industries. Medical text mining is well known as a fundamental data analytic for sentiment prediction. To form a high-dimensional sparse matrix, a popular preprocessing step in text mining is employed to transform medical text strings to word vectors. However, such a sparse matrix poses problems to the induction of accurate sentiment prediction model. The swarm search in our proposed OS-FS can be optimized by a new feature evaluation technique called clustering-by-coefficient-of-variation. In order to find a subset of features from all the original features from the sparse matrix, this type of feature selection has been a commonly utilized dimensionality reduction technique, and has the capability to improve accuracy of the prediction model. We implement this method based on a case scenario where 279 medical articles related to ‘meaningful use functionalities on health care quality, safety, and efficiency’ from a systematic review of previous medical IT literature. For this medical text mining, a multi-class of sentiments, positive, mixed-positive, neutral and negative is recognized from the document contents. Our experimental results demonstrate the superiority of OS-FS over traditional feature selection methods in literature.


Medical text mining Optimized swarm search-based feature selection Sentiment prediction Clustering-by-coefficient-of-variation 



This paper is supported by the research grant “Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF),” Grant No. MYRG2015-00128-FST, which is offered by the University of Macau, FST, and RDAO.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflicts of interest.

Ehtical approval

This article does not contain any studies with human participants and animals performed by any of the authors.


  1. 1.
    Lakshminarayan CK (2013) High dimensional big data and pattern analysis: a tutorial. In: Bhatnagar V, Srinivasa S (eds) Big data analytics, Lecture Notes in Computer Science, Springer, Cham. CrossRefGoogle Scholar
  2. 2.
    Yusta SC (2009) Different metaheuristic strategies to solve the feature selection problem. Pattern Recognit Lett 30(5):525–534. CrossRefGoogle Scholar
  3. 3.
    Fong S, Deb S, Yang XS, Li J (2014) Feature selection in life science classification: metaheuristic swarm search. IEEE IT Prof 16(4):24–29. CrossRefGoogle Scholar
  4. 4.
    Tsamardinos I, Aliferis CF, Statnikov A (2003) Time and sample efficient discovery of markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM Press, pp. 673–678Google Scholar
  5. 5.
    Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data, IEEE Trans Knowl Data Eng 25(1):1–14. CrossRefGoogle Scholar
  6. 6.
    Baris S (2008) Fast correlation based filter (FCBF) with a different search strategy. In Proceedings of 23rd international symposium on computer and information sciences, IEEE, Oct. 2008, pp. 1–4Google Scholar
  7. 7.
    Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In Proceedings of the 12th international florida artificial intelligence research society conference, pp. 235–239Google Scholar
  8. 8.
    Fong S, Deb S, Yang X-S, Li J (2014) Metaheuristic swarm search for feature selection in life science classification. IEEE IT Prof 16(4):24–29CrossRefGoogle Scholar
  9. 9.
    Fong S, Liang J, Wong R, Ghanavati M (2014) A novel feature selection by clustering coefficients of variations. In: 2014 ninth international conference on digital information management (ICDIM), 29 Sep–1 Oct 2014, pp. 205–213Google Scholar
  10. 10.
    Fong S, Liang J, Deb S (2013) Diabetics prediction by using feature selection based on coefficient of variation. In: Proceedings of Wilkes—international conference on computing sciences, New Delhi, November 2013Google Scholar
  11. 11.
    Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58CrossRefGoogle Scholar
  12. 12.
    Hassanien A-E, Azar T, Snásel A, Kacprzyk V, Abawajy J, J.H. (eds) (2015) Big data in complex systems: challenges and opportunities. Studies in Big Data. Springer, ChamGoogle Scholar
  13. 13.
    Muskan Kukreja SA, Johnston, Stafford P (2012) Comparative study of classification algorithms for immunosignaturing data. BMC Bioinf 13:139CrossRefGoogle Scholar
  14. 14.
    Platt J (1998) Fast training of support vector machines using sequential minimal optimization. In: Scholkopf B, Burges C, Smola A (eds) Advances in kernel methods: support vector learning. MIT Press, CambridgeGoogle Scholar
  15. 15.
    Jacob Eisenstein A, Ahmed, Xing EP (2011) Sparse additive generative models of text. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1041–1048Google Scholar
  16. 16.
    Hall MA (1998) Correlation-based feature subset selection for machine learning, PhD thesis, University of Waikato, Hamilton, New ZealandGoogle Scholar
  17. 17.
    Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: 13th international conference on machine learning, pp. 319–327Google Scholar
  18. 18.
    Ohta K, Moriai S, Aoki K (1995) Improving the Search Algorithm for the Best Linear Expression. Advances in cryptology—CRYPT0′95, Lecture Notes in Computer Science, vol 963, pp. 157–170CrossRefGoogle Scholar
  19. 19.
    Ferrer J, Kruse PM, Chicano F, Alba E (2015) Search based algorithms for test sequence generation in functional testing. Inf Softw Technol 58:419–432CrossRefGoogle Scholar
  20. 20.
    Bravo Y, Luque G, Alba E (2015) Takeovers time in evolutionary dynamic optimization: from theory to practice. Appl Math Comput 250(1):94–104Google Scholar
  21. 21.
    Moraglio A, Di Chio C, Poli R (2007) Geometric Particle Swarm Optimisation. In: Proceedings of the 10th European Conference on Genetic Programming, Berlin, Heidelberg, pp. 125–136Google Scholar
  22. 22.
    Jones SS, Rudin RS, Perry T, Shekelle PG (2014) Health information technology: an updated systematic review with a focus on meaningful use. Ann Intern Med 160(1):48–54CrossRefGoogle Scholar
  23. 23.
    Fong S, Zhang Y, Fiaidhi J, Mohammed O, Mohammed S (2013) Evaluation of stream mining classifiers for real-time clinical decision support system: a case study of blood glucose prediction in diabetes therapy. Biomed Res Int. CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Australasian College of Physical Scientists and Engineers in Medicine 2018

Authors and Affiliations

  • Daohui Zeng
    • 1
  • Jidong Peng
    • 2
  • Simon Fong
    • 3
  • Yining Qiu
    • 4
  • Raymond Wong
    • 4
  1. 1.First Affiliated Hospital of Guangzhou University of TCMGuangzhouPeople’s Republic of China
  2. 2.Ganzhou People’s HospitalJiangxiPeople’s Republic of China
  3. 3.Department of Computer and Information ScienceUniversity of MacauTaipaPeople’s Republic of China
  4. 4.School of Computer Science and EngineeringUniversity of New South WalesSydneyAustralia

Personalised recommendations