A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Habibi, Narjeskhatoon; Mohd Hashim, Siti Z; Norouzi, Alireza; Samian, Mohammed Razip

doi:10.1186/1471-2105-15-134

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Research article
Open access
Published: 08 May 2014

Volume 15, article number 134, (2014)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Download PDF

Narjeskhatoon Habibi¹,
Siti Z Mohd Hashim¹,
Alireza Norouzi¹ &
…
Mohammed Razip Samian^2,3,4

9886 Accesses
37 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.

Results

This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end.

Conclusions

This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.

Machine learning modeling for solubility prediction of recombinant antibody fragment in four different E. coli strains

Article Open access 31 March 2022

Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli

Article Open access 02 March 2016

HybridGCN for protein solubility prediction with adaptive weighting of multiple features

Article Open access 08 December 2023

Introduction

In biotechnology, production of recombinant proteins is a crucial process in both biopharmaceutical industries and scientific research. So far, Escherichia coli (E. coli), a bacterium that requires simple conditions to grow is still the favoured host for cloning and overexpressing most proteins which are non-glycosylated and do not have many cysteine residues [1].

Even though logical strategies of genetic engineering are well established, such as strong promoters and codon optimization, protein overexpression is often, still an art. In particular, heterologous expression is often afflicted with low levels of production and insoluble recombinant proteins forming inclusion bodies (protein aggregations). Yet, there is no generic solution available to enhance heterologous overexpression. The use of fusion proteins can sometimes be more successful at the expense of decreased total yield as a result of the fusion partner production. Features that differentiate between proteins in the negative (non-expressed) and positive (expressed) classes might indicate sequence characteristics that could be modified in optimization, corresponding to what was attained with codon optimization, where sequences of gene are modified to become compatible with the translational apparatus [2]. As the host expresses the proteins, one cause of non-expression is the harmful interaction with the metabolism of the host [3].

For a given protein, the extent of its solubility can indicate the quality of its function. In general, over 30% of recombinant proteins are not soluble [4]. About 33 to 35 percent of all expressed non-membrane proteins are insoluble and about 25 to 57 percent of soluble proteins are prone to aggregate at higher concentrations [5]. For a determined experimental condition (i.e. temperature, expression host, etc.), the solubility of a protein is determined by its sequence [6].

The trial-and-error procedure of protein overexpression can be avoided by identifying the promising proteins to improve the experimental success rate [7]. There are two types of approach for predicting solubility of protein: sequence-based and structure-based. In the structure-based technique, the free energy difference between aggregation and solution phases is computed. This method demands experimentally obtained high resolutions 3D structures which are hard to acquire for aggregation-prone proteins. Hence, the sequence-based technique is a feasible and widely used method. Generally, the computational sequence-based prediction methods investigate the protein overexpression in E. coli at the normal growth temperature of 37°C [8].

The correlation of amino acid sequence and the tendency to form inclusion body was shown for the first time by Wilkinson and Harrison [9]. Later, numerous methods based on machine learning were proposed to predict the solubility of proteins merely from amino acid sequences [10].

Protein solubility prediction can be considered a binary classification task where a classifier should discriminate between soluble proteins (positive samples) and insoluble proteins (negative samples). There are several classification methods (learning algorithm) namely, decision tree (DT) (e.g. C4.5 [11]), k-nearest-neighbour (KNN) [12], neural network (NN) [13, 14], support vector machine (SVM) [15], etc.

The learning algorithm (i.e. the classification method) is selected based on numerous factors, such as the number of existing examples in the dataset, the data type to be classified (e.g. symbolic or numeric), and the number of examples probable to be inaccurate or noisy. The level of preferred interpretability of the outcomes is another issue to be considered [16].

The majority of current methods use SVM to build the model of solubility [4]. Appropriate SVM models can often achieve better performance in classification of biological sequence compared to other machine learning-based approaches [1]. Each study employs a different set of features. Considering the model performance, different results are obtained, but 70% is a common accuracy in many studies [4].

To date, all of the prediction approaches examined a single system of protein expression, such as the A. niger or the E. coli system. The works of Hirose et al. [3, 10] are exceptions that explored two different systems (E. coli and wheat germ).

Some of the suggested methods of prediction offer their work as widely accessible web servers [3, 10, 17–20].

In spite of more than two decades of research on the subject, there has been only one report, reviewing seven solubility prediction tools [21]. In their valuable review, the authors have compared seven existing prediction tools based-on the following factors: prediction accuracy, usability, utility, and prediction tool development and validation methodologies. Our aim is to evaluate and investigate all published methods to predict protein solubility, so as to offer a detailed as well as a general understanding for the researchers.

The organization of the paper is as follows. The major protein solubility prediction studies are reviewed in section 2, with emphasis on their datasets, features, feature selection methods, predictor models and performance results. Section 3 presents a discussion on the models details, the best models and the data challenge for solubility prediction task. Lastly, section 4 concludes the paper and proposes some future research directions.

Review

The methods to predict solubility of protein based on the machine learning are summarized in Table 1 in a chronological order, descending from the most recent. Due to space limitation, the reported performance of the works and the features used in each work are shown in Table 2 and Table 3 respectively. More detailed descriptions of the works are presented in “Additional file 1”.

Table 1 A summary of key components of studies to predict protein solubility (in chronological order)

Full size table

Table 2 Reported prediction performances of the models (in chronological order)

Full size table

Table 3 Features used to predict protein solubility

Full size table

In the following tables, for an entry which does not have the corresponding column value, symbol “-” is used. For an entry which we could not find its value, but may exist, symbol “N/A” is used (N/A: Not applicable, not available or no answer).”

In order to comprehend the details of the works which are presented in Table 1, Table 2 and Table 3, datasets used, feature selection methods and performance measures are described in greater details in Table 4, Table 5 and Table 6 respectively.

Table 4 Databases/datasets used to predict protein solubility (in chronological order)

Full size table

Table 5 Description of feature selection methods used in machine learning[37]

Full size table

Table 6 Performance measures used to evaluate protein solubility prediction (in alphabetical order)

Full size table

It should be mentioned that in some works several modeling techniques are examined and then one or more are selected as the final model(s). In the “Modeling Technique(s)” column of Table 1, only the final models are shown. It is same true for the “Feature Selection Method(s)” column. In addition, in most of the works, first an initial feature set is considered, and then using feature selection methods, a smaller sub-set is obtained and employed in the modeling. Table 3 presents the final sets used in the modelings.

With respect to the data used in each study, some of the authors created a dataset harvested from the literature, some employed public datasets, while others performed experiments to generate their own dataset.

Discussion

This section investigates the works in more depth. In the following paragraph, the most used dataset, features, feature selection methods and learning techniques are presented. Afterwards, the best models based on the obtained accuracies are introduced. Then, the most convenient to use models are mentioned. Lastly, some data-related challenges are discussed.

In terms of data, eSol is the most widely used dataset in the field. Considering input features, the following features are the most common ones computed from the protein sequence: aliphatic index, amino acid sequence length, charge, amino acid compositions, instability, isoelectric point (pI), hydrophilicity, molecular weight, and predicted secondary structure. Filter methods (described in Table 5) are used more than the other feature selection techniques. Regarding the machine learning method, support vector machine is the most common technique to make prediction; random forest, decision tree and logistic regression are the next most common ones, respectively.

Based on the results, the method reported by Diaz et al. [20] obtained the best prediction accuracy (94%) on their generated dataset. Similar prediction accuracy was also reported by Samak et al. [4] with an accuracy of 90% on the eSol dataset, followed by the works of Xiaohui et al. [7], and Wilkinson and Harrison [9] with a prediction accuracies of 88% based on their generated datasets.

Comparing the different models in terms of convenience and ease of use, the ones with publicly accessible web servers can be considered the most convenient to use and evaluate. They are ProS [5], PROSOII [6], SCM [8], ESPRESSO [10], SOLpro [17], PROSO [19] and the model of Diaz et al. [20].

It seems that by using an appropriate dataset, as well as suitable machine learning techniques, reasonable prediction performance is attainable. In addition, feature selection methods can reveal, to some extent, influential factors on solubility and the sequence characteristics that could be modified in optimization.

Poor generalization ability is one of the limitations of sequence-based methods founded on a small dataset [35]. In general, extracting a reliable dataset, in terms of experimental conditions and expression system is challenging as the majority of databases that deliver the information on the solubility of proteins often do not provide comprehensive information about the experimental particulars of solubility assessment. Furthermore, researchers generally handle imbalanced (i.e. unequal number of soluble and insoluble samples) data when collecting protein solubility records. Consequently, numerous research teams used different methods to collect consistent datasets that divide proteins into insoluble and soluble categories [24, 27].

It is worth mentioning that the datasets employed to build SOLpro [17] and PROSOII [6] were gathered by integrating different search results of TargetDB, Protein Data Bank (PDB), and Swiss-Prot database. Then, the proteins were categorized into insoluble and soluble samples according to the proteins’ annotations. Although these methods were best working when an appropriate experimental dataset did not exist, they might not be reliable completely. A soluble protein without appropriate annotation, for example, may be incorrectly categorized as an insoluble protein and vice versa. Furthermore, annotations from diverse databases may not be consistent. Clearly, it is desirable to have a large protein set with solubility determined based on experiment by a single reliable protocol [5].

Conclusions

In this paper, the works to predict protein solubility prediction are reviewed in details. They are assessed and classified with regards to the datasets used, features used, feature selection methods, machine learning algorithms and performance results.

Since the early work of Wilkinson and Harrison [9], models later proposed became more complex in terms of dataset size, number and types of features employed, feature evaluation techniques and machine learning methods to make prediction. In general, the performances of the models have improved greatly as well.

Some of the models provide acceptable prediction performance (e.g. in terms of accuracy). Especially the ones with convenient user interfaces (e.g. web applications), can be considered valuable tools to anticipate recombinant protein overexpression results before performing real laboratory experiments. This capability will lead to significant reduction of labour, time and cost.

Generating larger and more accurate datasets, working on organisms other than E. coli and discovering other influential features, are some considerations for future directions in the protein solubility prediction field.

Authors’ information

NH received her M.Sc. in Artificial Intelligence from Isfahan University of Technology, Iran, in 2009 and B.Sc. in Software Engineering from the same university, in 2005. She is a faculty member of the Islamic Azad University (IAU) in Iran, since 2011. Presently she is pursuing Ph.D. in Computer Science at Universiti Teknologi Malaysia. Her research interests are bioinformatics, synthetic biology, artificial intelligence and machine learning.

SZMH is an Associate Professor at the Department of Software Engineering, Faculty of Computing, Universiti Teknologi Malaysia (UTM). She received her B.Sc. Degree in Computer Science from University of Harford, USA, M.Sc. in Computing from University of Bradford, UK and Ph.D. research in Soft Computing from University of Sheffield, UK. Her research interests are Soft Computing techniques and applications, System Development and Intelligent System. Currently she is the Deputy Dean of Academic, Faculty of Computing, UTM and a member of Soft Computing Research Group (SCRG), K-Economy, UTM.

AN received his M.Sc. in Computer Engineering from Islamic Azad University, Iran, in 2006 and B.Sc. in Computer Science from Yazd University, Iran, in 2003. He is a faculty member of the Islamic Azad University (IAU) in Iran, since 2007. Presently he is pursing Ph.D. in Computer Science at Universiti Teknologi Malaysia. His research interests focus on machine learning, pattern recognition and computer vision.

MRS received his Ph.D. from University of New South Wales, Australia, in Biotechnology. He is currently a faculty member (Professor) in the School of Biological Sciences, Universiti Sains Malaysia. The research in his laboratory focuses on molecular genetics and structural biology of proteins. He has published extensively in these areas.

References

Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN: Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinform. 2010, 11 (Suppl 1): S21-10.1186/1471-2105-11-S1-S21.
Article Google Scholar
van den Berg BA, Reinders MJ, Hulsman M, Wu L, Pel HJ, Roubos JA, de Ridder D: Exploring sequence characteristics related to high-level production of secreted proteins in aspergillus Niger. PLoS One. 2012, 7 (10): e45869-10.1371/journal.pone.0045869.
Article PubMed Central PubMed CAS Google Scholar
Hirose S, Kawamura Y, Yokota K, Kuroita T, Natsume T, Komiya K, Tsutsumi T, Suwa Y, Isogai T, Goshima N, Noguchi T: Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. J Biochem. 2011, 150 (1): 73-81. 10.1093/jb/mvr042.
Article PubMed CAS Google Scholar
Samak T, Gunter D, Wan Z: Prediction of Protein Solubility in E. coli. 2012, Chicago, IL: E-Science (e-Science), 2012 IEEE 8th International Conference on Date of Conference: 8-12 Oct. 2012, 1-8.
Google Scholar
Fang Y, Fang J: Discrimination of soluble and aggregation-prone proteins based on sequence information. Mol BioSyst. 2013, 9 (4): 806-811. 10.1039/c3mb70033j.
Article PubMed Central PubMed CAS Google Scholar
Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D: PROSO II-a new method for protein solubility prediction. FEBS J. 2012, 279 (12): 2192-2200. 10.1111/j.1742-4658.2012.08603.x.
Article PubMed CAS Google Scholar
Xiaohui N, Feng S, Xuehai H, Jingbo X, Nana L: Predicting the protein solubility by integrating chaos games representation and entropy in information theory. Expert Syst Appl. 2014, 41 (4): 1672-1679. 10.1016/j.eswa.2013.08.064.
Article Google Scholar
Huang H, Charoenkwan P, Kao T, Lee H, Chang F, Huang W, Ho S, Shu L, Chen W, Ho S: Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinfomratics. 2012, 13 (17): S3-
CAS Google Scholar
Wilkinson DL, Harrison RG: Predicting the solubility of recombinant proteins in Escherichia coli. Nat Biotechnol. 1991, 9 (5): 443-448. 10.1038/nbt0591-443.
Article CAS Google Scholar
Hirose S, Noguchi T: ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics. 2013, 13 (9): 1444-1456. 10.1002/pmic.201200175.
Article PubMed CAS Google Scholar
Quinlan JR: C4.5: Programs for Machine Learning. Vol: 1. 1993, USA: Morgan Kaufmann
Google Scholar
Cover T, Hart P: Nearest neighbor pattern classification. Inform Theory IEEE Transac. 1967, 13 (1): 21-27.
Article Google Scholar
Rosenblatt F: Principles of Neurodynamics. 1962, New York: Spartan
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ: Learning Internal Representations by Error Propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. 1985, California University San Diego La Jolla Institute for Cognitive Science, Technical rept. Mar-Sep 1985. (No. ICS-8506)
Google Scholar
Cortes C, Vapnik V: Support-vector networks. Mach Learn. 1995, 20 (3): 273-297.
Google Scholar
Bertone P, Kluger Y, Lan N, Zheng D, Christendat D, Yee A, Edwards AM, Arrowsmith CH, Montelione GT, Gerstein M: SPINE: An integrated tracking database and data mining approach for identifying feasible targets in high throughput structural proteomics. Nucleic Acids Res. 2001, 29 (13): 2884-2898. 10.1093/nar/29.13.2884.
Article PubMed Central PubMed CAS Google Scholar
Magnan CN, Randall A, Baldi P: SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics. 2009, 25 (17): 2200-2207. 10.1093/bioinformatics/btp386.
Article PubMed CAS Google Scholar
Davis GD, Elisee C, Newham DM, Harrison RG: New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng. 1999, 65 (4): 382-388. 10.1002/(SICI)1097-0290(19991120)65:4<382::AID-BIT2>3.0.CO;2-I.
Article PubMed CAS Google Scholar
Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D: Protein solubility: sequence based prediction and experimental verification. Bioinformatics. 2007, 23 (19): 2536-2542. 10.1093/bioinformatics/btl623.
Article PubMed CAS Google Scholar
Diaz AA, Tomba E, Lennarson R, Richard R, Bagajewicz MJ, Harrison RG: Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol Bioeng. 2010, 105 (2): 374-383. 10.1002/bit.22537.
Article PubMed CAS Google Scholar
Chang CCH, Song J, Tey BT, Ramanan RN: Bioinformatics Approaches for Improved Recombinant Protein Production in Escherichia coli: Protein Solubility Prediction. 2013, Oxford: Briefings in bioinformatics, bbt057, First published online August 7, 2013. doi:10.1093/bib/bbt057
Google Scholar
Stiglic G, Kocbek S, Pernek I, Kokol P: Comprehensive decision tree models in bioinformatics. PLoS One. 2012, 7 (3): e33812-10.1371/journal.pone.0033812.
Article PubMed Central PubMed CAS Google Scholar
Agostini F, Vendruscolo M, Tartaglia GG: Sequence-based prediction of protein solubility. J Mol Biol. 2012, 421 (2): 237-241.
Article PubMed CAS Google Scholar
Kocbek S, Stiglic G, Pernek I, Kokol P: Stability of different feature selection methods for selecting protein sequence descriptors in protein solubility classification problem. Transition. 2010, 7 (21): 50-55.
Google Scholar
Niwa T, Ying BW, Saito K, Jin W, Takada S, Ueda T, Taguchi H: Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc Natl Acad Sci. 2009, 106 (11): 4201-4206. 10.1073/pnas.0811922106.
Article PubMed Central PubMed CAS Google Scholar
Kumar P, Jayaraman VK, Kulkarni BD: Granular Support Vector Machine Based Method for Prediction of Solubility of Proteins on Overexpression in Escherichia coli. Pattern Recognition and Machine Intelligence, Second International Conference, PReMI 2007, Kolkata, India. 2007, Berlin Heidelberg: Springer, 406-415. Proceedings
Google Scholar
Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV: A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics. 2006, 22 (3): 278-284. 10.1093/bioinformatics/bti810.
Article PubMed CAS Google Scholar
Idicula‒Thomas S, Balaji PV: Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci. 2005, 14 (3): 582-592. 10.1110/ps.041009005.
Article Google Scholar
Luan C, Qiu S, Finley JB, Carson M, Gray RJ, Huang W, Johnson D, Tsao J, Reboul J, Vaglio P, Hill DE, Vidal M, DeLucas LJ, Luo M: High-throughput expression of C. elegans proteins. Genome Res. 2004, 14 (10b): 2102-2110. 10.1101/gr.2520504.
Article PubMed Central PubMed CAS Google Scholar
Goh C, Lan N, Douglas SM, Wu B, Echols N, Smith A, Milburn D, Montelione GT, Zhao H, Gerstein M: Mining the structural Genomics Pipeline: identification of protein properties that affect high throughput experimental analysis. J Mol Biol. 2004, 336 (1): 115-130. 10.1016/j.jmb.2003.11.053.
Article PubMed CAS Google Scholar
Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR, Booth V, Mackereth CD, Saridakis V, Ekiel I, Kozlov G, Maxwell KL, Wu N, McIntosh LP, Gehring K, Kennedy MA, Davidson AR, Pai EF, Gerstein M, Edwards AM, Arrowsmith CH: Structural Proteomics of an archaeon. Nat Struct Mol Biol. 2000, 7 (10): 903-909. 10.1038/82823.
Article CAS Google Scholar
Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ: PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006, 34 (2): W32-W37.
Article PubMed Central PubMed CAS Google Scholar
Maruyama Y, Wakamatsu A, Kawamura Y, Kimura K, Yamamoto J, Nishikawa T, Kisu Y, Sugano S, Goshima N, Isogai T, Nomura N: Human Gene and Protein Database (HGPD): a novel database presenting a large quantity of experiment-based results in human proteomics. Nucleic Acid Research. 2009, 37 (1): D762-D766.
Article CAS Google Scholar
Kouranov A, Xie L, de la Cruz J, Chen L, Westbrook J, Bourne PE, Berman HM: The RCSB PDB information portal for structural genomics. Nucleic Acids Res. 2006, 34 (1): D302-D305.
Article PubMed Central PubMed CAS Google Scholar
Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C: The Protein Data Bank. Acta Crystallographica Section D: Biological Crystallography. 2002, 58 (6): 899-907. 10.1107/S0907444902003451.
Article Google Scholar
Chen L, Oughtred R, Berman HM, Westbrook J: TargetDB: a target registration database for structural genomics projects. Bioinformatics. 2004, 20 (16): 2860-2862. 10.1093/bioinformatics/bth300.
Article PubMed CAS Google Scholar
Saeys Y, Inza I, Larrañaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 2507-2517. 10.1093/bioinformatics/btm344.
Article PubMed CAS Google Scholar
Ben-Bassat M: Pattern Recognition and Reduction of Dimensionality. Handbook of Statistics. Vol: 2. Edited by: Krishnaiah P, Kanal L. 1982, Amsterdam: North-Holland Publishing Co, 773-910.
Google Scholar
Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2005, USA: Morgan Kaufmann, 2
Google Scholar
Weston J, Pérez-Cruz F, Bousquet O, Chapelle O, Elisseeff A, Schölkopf B: Feature selection and transduction for prediction of molecularbioactivity for drug design. Bioinformatics. 2003, 19: 764-771. 10.1093/bioinformatics/btg054.
Article PubMed CAS Google Scholar
Mann HB, Whitney DR: On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947, 18 (1): 50-60. 10.1214/aoms/1177730491.
Article Google Scholar
Kittler J: Feature Set Search Algorithms. Pattern Recognition and Signal Processing. Edited by: Chen C. 1978
Google Scholar
Siedlecki W, Sklansky J: On automatic feature selection. Int J Pattern Recognit Artif Intell. 1998, 2 (02): 197-220.
Article Google Scholar
Kononenko I, Šimec E, Robnik-Šikonja M: Overcoming the Myopia of inductive learning algorithms with RELIEFF. Appl Intell. 1997, 7 (1): 39-55. 10.1023/A:1008280620621.
Article Google Scholar
Breiman L: Random forests. Mach Learn. 2001, 5 (1): 5-32.
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn. 2002, 46 (1-3): 389-422.
Article Google Scholar
Piatetsky-Shapiro G: Discovery, analysis and presentation of strong rules. Knowledge Discovery in Databases. Edited by: Piatetsky-Shapiro G, Frawley WJ. 1991, Cambridge: MA
Google Scholar
de Ridder D, de Ridder J, Reinders MJ: Pattern recognition in bioinformatics. Brief Bioinform. 2013, 14 (5): 633-647. 10.1093/bib/bbt020.
Article PubMed Google Scholar

Download references

Acknowledgment

This work was supported by the Ministry of Higher Education of Malaysia [Grant No. KPT.B.600-18/3 (115) to NH]; Universiti Sains Malaysia [FRGS grant to MRS]; and Universiti Teknologi Malaysia. The authors appreciate the anonymous reviewers’ instructive suggestions.

Author information

Authors and Affiliations

Faculty of Computing, Universiti Teknologi Malaysia, Johor, Malaysia
Narjeskhatoon Habibi, Siti Z Mohd Hashim & Alireza Norouzi
School of Biological Sciences, Universiti Sains Malaysia, Penang, Malaysia
Mohammed Razip Samian
Advanced Medical and Dental Institute, Universiti Sains Malaysia, Penang, Malaysia
Mohammed Razip Samian
Centre for Chemical Biology, Universiti Sains Malaysia, Penang, Malaysia
Mohammed Razip Samian

Authors

Narjeskhatoon Habibi
View author publications
You can also search for this author in PubMed Google Scholar
Siti Z Mohd Hashim
View author publications
You can also search for this author in PubMed Google Scholar
Alireza Norouzi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Razip Samian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Narjeskhatoon Habibi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

NH carried out the literature review studies and drafted the manuscript. SZMH and MRS conceived the idea of the study, and helped to draft the manuscript. AN helped to draft the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2013_6385_MOESM1_ESM.docx

Additional file 1: In detailed descriptions of 24 studies to predict protein solubility during 1991–2014 (February).(DOCX 272 KB)

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Habibi, N., Mohd Hashim, S.Z., Norouzi, A. et al. A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinformatics 15, 134 (2014). https://doi.org/10.1186/1471-2105-15-134

Download citation

Received: 04 September 2013
Accepted: 25 March 2014
Published: 08 May 2014
DOI: https://doi.org/10.1186/1471-2105-15-134

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli