Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach

Methodologies and Application


Entity extraction is an important step in biomedical text mining. Among many other challenges, there are two very crucial issues, viz. determining the most applicable feature set so that the model can be precise and less complex, and adapting the system across multiple benchmark corpora. In this paper, we propose a novel method for feature selection using the search capability of particle swarm optimization. The compact feature set used for training the classifier yields much better results when compared to the baseline model, which was developed with a complete set of features. A large number of features suitable for named entity recognition task from biomedical domain are also developed in the current paper. The complete set of features is implemented by studying the properties of datasets and from the domain knowledge. We have used conditional random field, a robust classifier as the underlying learning algorithm which has shown success in solving similar kinds of problems. Our experiments on multiple benchmark corpora yield the level of performance which are at par the state-of-the-art techniques.


Particle swarm optimization (PSO) Feature selection Condition random field Entity extraction 


  1. Aghdam MH, Heidari S (2015) Feature selection using particle swarm optimization in text categorization. J Artif Intell Soft Comput Res 5(4):231–238CrossRefGoogle Scholar
  2. Alatas B, Akin E (2008) Rough particle swarm optimization and its applications in data mining. Soft Comput 12(12):1205–1218CrossRefMATHGoogle Scholar
  3. Ando RK (2007) Biocreative II gene mention tagging system at IBM Watson. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 101–103Google Scholar
  4. Baumgartner Jr WA, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A (2007) An integrated approach to concept recognition in biomedical text. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 257–271Google Scholar
  5. Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71Google Scholar
  6. Bickel S, Brefeld U, Faulstich L, Hakenberg J, Leser U, Plake C (2004) A support vector machine classifier for gene name recognition. In: Embo workshop: a critical assessment of text mining methods in molecular biology, Granada, SpainGoogle Scholar
  7. Cagnina LC, Errecalde ML, Ingaramo DA, Rosso P (2008) A discrete particle swarm optimizer for clustering short-text corpora. In: Proceedings of bioinspired optimization methods and their applications, BIOMA-2008, Ljubljana, SloveniaGoogle Scholar
  8. Chen W-N, Zhang J, Lin Y, Chen N, Zhan Z-H, Chung HS-H (2013) Particle swarm optimization with an aging leader and challengers. IEEE Trans Evol Comput 17(2):241–258CrossRefGoogle Scholar
  9. Chinnaswamy A, Srinivasan R (2016) Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data. In: Innovations in bio-inspired computing and applications. Springer, Berlin, pp 229–239Google Scholar
  10. Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32(1):29–38CrossRefMATHGoogle Scholar
  11. Correa ES, Freitas AA, Johnson CG (2006) A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set. In: Proceedings of the 8th annual conference on genetic and evolutionary computation, pp 35–42Google Scholar
  12. Das S (2001) Filters, wrappers and a boosting-based hybrid for feature selection. In: ICML, vol 1, pp 74–81Google Scholar
  13. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205CrossRefGoogle Scholar
  14. Eberhart RC, Shi Y (1998) Comparison between genetic algorithms and particle swarm optimization. In: International conference on evolutionary programming, pp 611–616Google Scholar
  15. Ekbal A, Saha S (2013) Stacked ensemble coupled with feature selection for biomedical entity extraction. Knowl Based Syst 46:22–32CrossRefGoogle Scholar
  16. Ekbal A, Saha S, Sikdar UK (2013) Biomedical named entity extraction: some issues of corpus compatibilities. SpringerPlus 2(1):1CrossRefGoogle Scholar
  17. Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinform 6(Suppl 1):S5CrossRefGoogle Scholar
  18. Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G (2004) Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 88–91Google Scholar
  19. Ganchev K, Crammer K, Pereira F, Mann G, Bellare K, McCallum A, Carroll S, Jin Y, White P (2007) Penn/umass/chop biocreative II systems. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23. pp 119–124Google Scholar
  20. Ghamisi P, Benediktsson JA (2015) Feature selection based on hybridization of genetic algorithm and particle swarm optimization. IEEE Geosci Remote Sens Lett 12(2):309–313CrossRefGoogle Scholar
  21. Grover C, Haddow B, Klein E, Matthews M, Nielsen LA, Tobin R (2007) Adapting a relation extraction pipeline for the biocreative II task. In: Proceedings of the biocreative II workshop, vol 2Google Scholar
  22. GuoDong Z, Jian S (2004) Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 96–99Google Scholar
  23. Gupta DK, Reddy KS, Ekbal A (2015) Pso-asent: feature selection using particle swarm optimization for aspect based sentiment analysis. In: International conference on applications of natural language to information systems, pp 220–233Google Scholar
  24. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422CrossRefMATHGoogle Scholar
  25. Holland JH (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press, Cambridge, MA, USAGoogle Scholar
  26. Hsieh S-T, Sun T-Y, Liu C-C, Tsai S-J (2009) Efficient population utilization strategy for particle swarm optimizer. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):444–456CrossRefGoogle Scholar
  27. Huang H-S, Lin Y-S, Lin K-T, Kuo C-J, Chang Y-M, Yang B-H (2007) High-recall gene mention recognition by unification of multiple backward parsing models. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 109–111Google Scholar
  28. Juang C-F (2004) A hybrid of genetic algorithm and particle swarm optimization for recurrent network design. IEEE Trans Syst Man Cybern Part B (Cybern) 34(2):997–1006CrossRefGoogle Scholar
  29. Kao Y-T, Zahara E (2008) A hybrid genetic algorithm and particle swarm optimization for multimodal functions. Appl Soft Comput 8(2):849–857CrossRefGoogle Scholar
  30. Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics, 1997. Computational cybernetics and simulation, vol 5, pp 4104–4108Google Scholar
  31. Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 70–75Google Scholar
  32. Kim S, Yoon J, Park K-M, Rim H-C (2005) Two-phase biomedical named entity recognition using a hybrid method. In: Natural language processing—IJCNLP 2005. Springer, Berlin, pp 646–657Google Scholar
  33. Kinoshita S, Cohen KB, Ogren PV, Hunter L (2005) Biocreative Task1A: entity identification with a stochastic tagger. BMC Bioinform 6(Suppl 1):S4CrossRefGoogle Scholar
  34. Kittler J (1978) Feature set search algorithms. In: Pattern recognition and signal processing, pp 41–60Google Scholar
  35. Klinger R, Friedrich CM, Fluck J, Hofmann-Apitius M (2007) Named entity recognition with combinations of conditional random fields. In: Proceedings of the second biocreative challenge evaluation workshopGoogle Scholar
  36. Krisshna NA, Deepak VK, Manikantan K, Ramachandran S (2014) Face recognition using transform domain feature extraction and pso-based feature selection. Appl Soft Comput 22:141–161CrossRefGoogle Scholar
  37. Kumar A, Patidar V, Khazanchi D, Saini P (2016) Optimizing feature selection using particle swarm optimization and utilizing ventral sides of leaves for plant leaf classification. Procedia Comput Sci 89:324–332CrossRefGoogle Scholar
  38. Kuo C-J, Chang Y-M, Huang H-S, Lin K-T, Yang B-H, Lin Y-S (2007) Rich feature set, unification of bidirectional parsing and dictionary filtering for high f-score gene mention tagging. In: Proceedings of the second biocreative challenge evaluation workshop, vol 23, pp 105–107Google Scholar
  39. Lafferty JD, McCallum A, Pereia FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, pp 282–289. http://dl.acm.org/citation.cfm?id=645530.655813
  40. Lin S-W, Lee Z-J, Chen S-C, Tseng T-Y (2008) Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl Soft Comput 8(4):1505–1512CrossRefGoogle Scholar
  41. Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35(4):1817–1824CrossRefGoogle Scholar
  42. Liu H, Torii M, Hu Z, Wu C (2007) Gene mention and gene normalization based on machine learning and online resources. In: Proceedings of the second biocreative challenge workshop, pp 135–140Google Scholar
  43. Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200CrossRefGoogle Scholar
  44. Liu Z, Liu S, Liu L, Sun J, Peng X, Wang T (2016) Sentiment recognition of online course reviews using multi-swarm optimization-based selected features. Neurocomputing 185:11–20CrossRefGoogle Scholar
  45. Lu Y, Liang M, Ye Z, Cao L (2015) Improved particle swarm optimization algorithm and its application in text feature selection. Appl Soft Comput 35:629–636CrossRefGoogle Scholar
  46. McDonald R, Pereira F (2005) Identifying gene and protein mentions in text using conditional random fields. BMC Bioinform 6(Suppl 1):S6CrossRefGoogle Scholar
  47. Merwe D, Van der Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: The 2003 Congress on evolutionary computation, 2003. CEC’03, vol 1, pp 215–220Google Scholar
  48. Mitsumori T, Fation S, Murata M, Doi K, Doi H (2005) Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinform 6(Suppl 1):S8CrossRefGoogle Scholar
  49. Park K-M, Kim S-H, Rim H-C, Hwang Y-S (2006) ME-based biomedical named entity recognition using lexical knowledge. ACM Trans Asian Lang Inf Process (TALIP) 5(1):4–21CrossRefGoogle Scholar
  50. Pedersen MEH (2010) Good parameters for particle swarm optimization. Hvass Lab., Copenhagen, Denmark, Tech. Rep. HL1001Google Scholar
  51. Peram T, Veeramachaneni K, Mohan CK (2003) Fitness-distance-ratio based particle swarm optimization. In: Proceedings of the 2003 IEEE Swarm intelligence symposium, 2003. SIS’03Google Scholar
  52. Ponomareva N, Pla F, Molina A, Rosso P (2007) Biomedical named entity recognition: a poor knowledge hmm-based approach. In: Natural language processing and information systems. Springer, Berlin, pp 382–387Google Scholar
  53. Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice-Hall, Inc., NJ, USAGoogle Scholar
  54. Ramadan RM, Abdel-Kader RF (2009) Face recognition using particle swarm optimization-based selected features. Int J Signal Process Image Process Pattern Recognit 2(2):51–65Google Scholar
  55. Saha SK, Sarkar S, Mitra P (2009) Feature selection techniques for maximum entropy based biomedical named entity recognition. J Biomed Inform 42(5):905–911CrossRefGoogle Scholar
  56. Samadzadegan F, Saeedi S (2009) Clustering of lidar data using particle swarm optimization algorithm in urban area. Laserscanning 09(38):334–339Google Scholar
  57. Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 104–107Google Scholar
  58. Shang L, Zhou Z, Liu X (2016) Particle swarm optimization-based feature selection in sentiment classification. Soft Comput 20(10):1–14. doi:10.1007/s00500-016-2093-2
  59. Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput 24(111):647–656MathSciNetCrossRefMATHGoogle Scholar
  60. Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3/4):591–611MathSciNetCrossRefMATHGoogle Scholar
  61. Sheikhpour R, Sarram MA, Sheikhpour R (2016) Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer. Appl Soft Comput 40:113–131CrossRefGoogle Scholar
  62. Shi Y, Eberhart R (1998) A modified particle swarm optimizer. In: The 1998 IEEE international conference on evolutionary computation proceedings, 1998. IEEE World Congress on computational intelligence, pp 69–73Google Scholar
  63. Shi Y, Eberhart RC (2001) Fuzzy adaptive particle swarm optimization. In: Proceedings of the 2001 Congress on evolutionary computation, vol 1, pp 101–106Google Scholar
  64. Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Proceedings of the eleventh international conference on machine learning, pp 293–301Google Scholar
  65. Song Y, Kim E, Lee GG, Yi B-k (2004) POSBIOTM-NER in the shared task of BioNLP/NLPBA 2004. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications, pp 100–103Google Scholar
  66. Struble CA, Povinelli RJ, Johnson MT, Berchanskiy D, Tao J, Trawicki M (2007) Combined conditional random fields and n-gram language models for gene mention recognition. In: Proceedings of the second biocreative challenge evaluation workshop; 23–25 April 2007; Madrid, Spain, pp 81–83Google Scholar
  67. Tran B, Xue B, Zhang M (2014) Overview of particle swarm optimisation for feature selection in classification. In: Asia-Pacific conference on simulated evolution and learning, pp 605–617Google Scholar
  68. Vlachos A (2007) Tackling the biocreative2 gene mention task with conditional random fields and syntactic parsing. In: Proceedings of the second biocreative challenge evaluation workshop; 23–25 April 2007; Madrid, Spain, pp 85–87Google Scholar
  69. Wang H, Zhao T, Tan H, Zhang S (2008) Biomedical named entity recognition based on classifiers ensemble. IJCSA 5(2):1–11Google Scholar
  70. Xi M-L, Sun J, Wu Y (2010) Quantum-behaved particle swarm optimization with binary encoding. Control Decis 1:019MathSciNetGoogle Scholar
  71. Yan X, Wu Q, Liu H, Huang W (2013) An improved particle swarm optimization algorithm and its application. Int J Comput Sci Issues (IJCSI) 10(1):316–324Google Scholar
  72. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MathSciNetMATHGoogle Scholar
  73. Zhang J-R, Zhang J, Lok T-M, Lyu MR (2007) A hybrid particle swarm optimization-back-propagation algorithm for feedforward neural network training. Appl Math Comput 185(2):1026–1037MATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.Department of Computer Science EngineeringIndian Institute of Technology PatnaPatnaIndia

Personalised recommendations