Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
© The Microbiological Society of Korea
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Acknowledgments
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2022M3A9B6082687 and NRF-2023R1A2C1008156) and was also supported by the Chung-Ang University Young Scientist Scholarship in 2021.
Conflict of Interest
The authors declare that they have no conflict of interest.
Model | Dataset (soluble + insoluble) | Machine learning algorithm | Features | Performances | Pros | Cons | References | |
---|---|---|---|---|---|---|---|---|
Classification models | PROSO | E. coli | Chained models of SVM and naïve Bayes | Amino acid compositions and alpha-helical structure composition | Crossvalidation: | Computationally efficient, simple to use | Lacks structural data, limited in complex solubility cases | Smialowski et al. (2007) |
> 14,000 | Accuracy: 0.72 | |||||||
(50% and 50%)1) | AUC: 0.78 | |||||||
Model is not available to access | ||||||||
SOLpro | E. coli | Chained 20 SVM models with 1 SVM output model | Compositions (amino acids, dipeptides, and tripeptides), physicochemical properties (hydropathy, charge, molecular weight, aliphatic index, etc.), secondary structure composition, exposed residues, number of domains, etc. | Crossvalidation: | Advanced SVM-based architecture, better accuracy | Constrained by sequence-only features, no structural insight | Magnan et al. (2009) | |
17408 | Accuracy: 0.74 | |||||||
(8,704+8,704) | AUC: 0.74 | |||||||
Web-based tool is available at https://scratch.proteomics.ics.uci.edu/ | ||||||||
PROSO II | E. coli | Chained model: two input models of Parzen window model and logistic regression classifier, and an output model of logistic regression classifier | Compositions (amino acids, dipeptides, and tripeptides), physicochemical properties (isoelectric point, GRAVY index, etc.), secondary structures, exposed residues, and number of domains | Independent test: | Improved with larger datasets and refined algorithms | No structural data, reduced accuracy for complex proteins | Smialowski et al. (2012) | |
82,2992) | Accuracy: 0.75 | |||||||
for training | ||||||||
1765 | ||||||||
(⅙ + ⅚)1) | ||||||||
hold-out for test | ||||||||
Model is not available to access | ||||||||
PaRSnIP | E. coli | Gradient boosting machine | Compositions (amino acids, dipeptides, and tripeptidmagnes), sequence length, molecular weight, fraction of turn-forming residues, average hydropathicity, aliphatic index, absolute charge, secondary structures, hydrophobicity of exposed residues. | Independent test: | Effective use of GBM for feature handling | Relies on manual feature selection, no structural data | Rawi et al. (2018) | |
69420 | Accuracy: 0.74 | |||||||
(28,972+40,448) for training | MCC: 0.48 | |||||||
2001 | ||||||||
(1,000+1,001)3) for testing | ||||||||
Source codes are available at https://github.com/RedaRawi/PaRSnIP | ||||||||
DeepSol | E. coli | Convolutional neural network | Amino acid compositions, molecular weight, absolute charge, aliphatic index, average hydropathicity (GRAVY), fraction of turn-forming residues, secondary structures, fraction of exposed residues, and hydrophobicity of exposed residues. | Independent test: | Automated feature learning with higher accuracy | Lacks structural data, limited for complex proteins | Khurana et al. (2018) | |
69420 | Accuracy: 0.77 | |||||||
(28,972+40,448) for training | MCC4): 0.55 | |||||||
2001 | ||||||||
(1,000+1,001)3) for testing | ||||||||
Source codes are available at https://zenodo.org/records/1162886 | ||||||||
SoluProt | E. coli | Gradient boosting machine | Compositions (amino acids and dipeptides), physicochemical properties, average flexibility, secondary structure content, average disorder, residue content in transmembrane helices, maximum identity to E. coli proteins in PDB | Independent test: | Robust handling of noisy data, effective feature selection | Lower accuracy and MCC, less suited for high-precision tasks | Hon et al. (2021) | |
11436 | Accuracy: 0.59 | |||||||
(5,718+5,718) for training | MCC: 0.17 | |||||||
3100 | ||||||||
(1,550+1,550) for testing | ||||||||
Web-based tool and datasets are available at https://loschmidt.chemi.muni.cz/soluprot/ | ||||||||
NetSolP | E. coli | Two input models, ESM1b and ProtT5 transformer-based models, with an output passed to a classification layer | Sequence embeddings (from transformer models like ESM1b and ProtT5), sequence profiles (MSAs generated using HHblits), and amino acid conservation (calculated using conservation scores). | Independent test: | Transformer-based model captures complex sequence-residue interactions | Moderate accuracy and MCC, computationally demanding | Thumuluri et al. (2022) | |
12216 | Accuracy: 0.76 | |||||||
(66% + 34%)1) for training | MCC: 0.40 | |||||||
1323 | ||||||||
(620+703) for testing | ||||||||
Source codes are available at https://github.com/TviNet/NetSolP-1.0 | ||||||||
PROTSOLM | E. coli | Multi-modal model: two input models, ESM2 for protein sequence embedding and equivariant graph neural networks (EGNNs) for structural feature encoding, combined with an output model of a deep learning classifier. Gradient boosting machine | Sequence embeddings (ESM2-650M), inter-residue distances, backbone geometry, physicochemical properties (charged residues, GRAVY index, and turn-forming residues), secondary structure content, solvent accessibility, hydrogen bond density, hydrophobicity of exposed residues, and structure confidence (pLDDT from ESMFold). | Independent test: | Multimodal approach integrates sequence and structure data | Dependence on structural data may limit broad applicability | Tan et al. (2024) | |
64598 | Accuracy: 0.79 | |||||||
(33,763+30,835) for training | MCC: 0.58 | |||||||
3230 | Independent test: | |||||||
(1,675+1,555) for testing | Accuracy: 0.60 | |||||||
MCC: 0.22 | ||||||||
2155 | Independent test: | |||||||
(951+1,204)6) for testing | Accuracy: 0.60 | |||||||
MCC: 0.23 | ||||||||
1784 | ||||||||
(1,052+732)6) for testing | ||||||||
3640 | Independent test: | |||||||
(1,817+1,823)6) for testing | Accuracy: 0.60 | |||||||
MCC: 0.21 | ||||||||
Source codes are available at https://github.com/tyang816/ProtSolM | ||||||||
PLM_Sol | E. coli | Two input embedding models, ProtT5 and ESM2, are combined with an output model of biLSTM_TextCNN layer | Protein sequence embeddings (capturing contextual information such as residue-level interactions and sequence structure). | Independent test: | Leverages PLMs for richer contextual embeddings | Computationally intensive, dependent on large datasets | Zhang et al. (2024) | |
79344 | Accuracy: 0.72 | |||||||
(47,291+32,053) for training | MCC: 0.46 | |||||||
4000 | ||||||||
(2,000+2,000) for testing | ||||||||
Source codes are available at https://zenodo.org/records/12881509 | ||||||||
DeepSoluE | E. coli | Long Short-Term Memory network | Physicochemical properties (isoelectric point, aromaticity, molecular weight, flexibility, and instability index), sequence embedding, and secondary structure content, along with structural-based features (protein sequence length, residue-level solvent accessibility, and torsion angle domain). | Independent test: | Balanced approach, integrates physicochemical features | Moderate accuracy and MCC, less suited for high-precision tasks | Wang & Zou (2023) | |
11436 | Accuracy: 0.59 | |||||||
(5,718+ 5,718) for training | MCC: 0.18 | |||||||
3100 | ||||||||
(1,550+1,550) for testing | ||||||||
Web-based tool and datasets are available at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ | ||||||||
Regression models | SOLart | E. coli | Random forest | Compositions of amino acids, secondary structure content, protein length, protein solvent accessibility, and statistical potentials (residue-level solvent accessibility and torsion angle domain). | Independent tests: | Accurate for quantitative solubility predictions, strong cross-species performance | Limited by dependence on 3D structural data | Hou et al. (2020) |
4065) for training | on E. coli | |||||||
R2: 0.448 | ||||||||
E. coli | RMSE: 23% | |||||||
5505) for testing | on S. cerevisiae | |||||||
R2: 0.608, 0.490 | ||||||||
S. cerevisiae | RMSE: 23%, 20% | |||||||
59 and 505) for testing | ||||||||
Web-based tool is available at http://babylone.ulb.ac.be/SOLART | ||||||||
SVR Model | E. coli | Support Vector Regression | Compositions of amino acids | Independent tests: | Efficient solubility optimization, successful experimental validation and versatile applicability | No internal test dataset and limited consideration of stability | Han et al. (2020) | |
3,1485) for training | on E. coli | |||||||
R2: 0.57 | ||||||||
4 proteins5) | ||||||||
for experimental validation | ||||||||
Source codes are available at https://github.com/KangZhouGroupNUS/optimization_protein-solubility | ||||||||
GraphSol | E. coli | Graph convolutional network | Hidden Markov model, PSSM, diverse physicochemical properties (steric parameters, hydrophobicity, volume, polarizability, isoelectric point, etc.), relative solvent accessible surface area, backbone torsion angles, protein contact map, etc. | Independent tests: | Strong integration of sequence and structure | Dependent on structural data, limiting generalizability | Chen et al. (2021) | |
20525) for training | on E. coli | |||||||
R2: 0.48 | ||||||||
E. coli | on S. cerevisiae | |||||||
6855) for testing | R2: 0.37 | |||||||
S. cerevisiae | ||||||||
1085) for testing | ||||||||
Source codes are available at https://github.com/jcchan23/GraphSol |
1) Only percentages or ratios have been reported.
2) The ratio of soluble and insoluble data has not been reported.
3) External qualitative solubility dataset from the study of Chang et al. (2014).
4) Mathew’s correlation coefficient (MCC), a balanced accuracy for imbalanced dataset.
5) Quantitative solubility datasets for regression model training and testing.
6) External qualitative solubility dataset from literature of Tan et al., Niwa et al., and Smialowski et al., respectively (Niwa et al., 2009; Smialowski et al., 2012; Tan et al., 2024).
Tool name | Number of features | Feature | URL | References |
---|---|---|---|---|
PROFEAT | < 2,000 | Residue compositions, physicochemical properties, sequence order and secondary structures, topological characteristics, interaction patterns, and other network properties | http://bidd2.nus.edu.sg/cgi-bin/profeat2016/main.cgi* | Zhang et al. (2017) |
iFeatureOmega | > 18,000 | Residue compositions, physicochemical properties, sequence order and secondary structures, half sphere exposure, residue depth, atom composition and network-based index | https://github.com/Superzchen/iFeatureOmega-CLI | Chen et al. (2022) |
protr | 22700 | Residue compositions, physicochemical properties, secondary structure, similarity score, customizable descriptors (AAindex database), Auxiliary functions | https://github.com/nanxstats/protr | Xiao et al. (2015) |
Rcpi | > 10,000 | Residue composition, physicochemical properties, secondary structures, PSSM profile, PCM, GO similarity, sequence similarity. Rcpi also provides compound-related features and protein-compound/protein-protein interactions features | https://github.com/nanxstats/Rcpi | Cao et al. (2015) |
Propy | 9547 | Residue compositions, physicochemical properties, sequence order coupling numbers, pseudo amino acids compositions. | https://github.com/MartinThoma/propy3 | Cao et al. (2013) |
PDBparam | > 50 | Physicochemical properties, secondary structures, inter-residue interactions, identification of binding sites from PDB structure | https://www.iitm.ac.in/bioinfo/pdbparam/index.html | Nagarajan et al. (2016) |
POSSUM | 12010 | PSSM-based features | https://possum.erc.monash.edu/ | Wang et al. (2017) |
Pfeature | 200,000+ | Diverse sequence-based features, binary profiles, evolutionary information based on PSSM, structural features, and pattern-based features | https://github.com/raghavagps/Pfeature | Pande et al. (2023) |
Model | Dataset (soluble + insoluble) | Machine learning algorithm | Features | Performances | Pros | Cons | References | |
---|---|---|---|---|---|---|---|---|
Classification models | PROSO | E. coli | Chained models of SVM and naïve Bayes | Amino acid compositions and alpha-helical structure composition | Crossvalidation: | Computationally efficient, simple to use | Lacks structural data, limited in complex solubility cases | |
> 14,000 | Accuracy: 0.72 | |||||||
(50% and 50%)1) | AUC: 0.78 | |||||||
Model is not available to access | ||||||||
SOLpro | E. coli | Chained 20 SVM models with 1 SVM output model | Compositions (amino acids, dipeptides, and tripeptides), physicochemical properties (hydropathy, charge, molecular weight, aliphatic index, etc.), secondary structure composition, exposed residues, number of domains, etc. | Crossvalidation: | Advanced SVM-based architecture, better accuracy | Constrained by sequence-only features, no structural insight | ||
17408 | Accuracy: 0.74 | |||||||
(8,704+8,704) | AUC: 0.74 | |||||||
Web-based tool is available at |
||||||||
PROSO II | E. coli | Chained model: two input models of Parzen window model and logistic regression classifier, and an output model of logistic regression classifier | Compositions (amino acids, dipeptides, and tripeptides), physicochemical properties (isoelectric point, GRAVY index, etc.), secondary structures, exposed residues, and number of domains | Independent test: | Improved with larger datasets and refined algorithms | No structural data, reduced accuracy for complex proteins | ||
82,2992) | Accuracy: 0.75 | |||||||
for training | ||||||||
1765 | ||||||||
(⅙ + ⅚)1) | ||||||||
hold-out for test | ||||||||
Model is not available to access | ||||||||
PaRSnIP | E. coli | Gradient boosting machine | Compositions (amino acids, dipeptides, and tripeptidmagnes), sequence length, molecular weight, fraction of turn-forming residues, average hydropathicity, aliphatic index, absolute charge, secondary structures, hydrophobicity of exposed residues. | Independent test: | Effective use of GBM for feature handling | Relies on manual feature selection, no structural data | ||
69420 | Accuracy: 0.74 | |||||||
(28,972+40,448) for training | MCC: 0.48 | |||||||
2001 | ||||||||
(1,000+1,001)3) for testing | ||||||||
Source codes are available at |
||||||||
DeepSol | E. coli | Convolutional neural network | Amino acid compositions, molecular weight, absolute charge, aliphatic index, average hydropathicity (GRAVY), fraction of turn-forming residues, secondary structures, fraction of exposed residues, and hydrophobicity of exposed residues. | Independent test: | Automated feature learning with higher accuracy | Lacks structural data, limited for complex proteins | ||
69420 | Accuracy: 0.77 | |||||||
(28,972+40,448) for training | MCC4): 0.55 | |||||||
2001 | ||||||||
(1,000+1,001)3) for testing | ||||||||
Source codes are available at |
||||||||
SoluProt | E. coli | Gradient boosting machine | Compositions (amino acids and dipeptides), physicochemical properties, average flexibility, secondary structure content, average disorder, residue content in transmembrane helices, maximum identity to E. coli proteins in PDB | Independent test: | Robust handling of noisy data, effective feature selection | Lower accuracy and MCC, less suited for high-precision tasks | ||
11436 | Accuracy: 0.59 | |||||||
(5,718+5,718) for training | MCC: 0.17 | |||||||
3100 | ||||||||
(1,550+1,550) for testing | ||||||||
Web-based tool and datasets are available at |
||||||||
NetSolP | E. coli | Two input models, ESM1b and ProtT5 transformer-based models, with an output passed to a classification layer | Sequence embeddings (from transformer models like ESM1b and ProtT5), sequence profiles (MSAs generated using HHblits), and amino acid conservation (calculated using conservation scores). | Independent test: | Transformer-based model captures complex sequence-residue interactions | Moderate accuracy and MCC, computationally demanding | ||
12216 | Accuracy: 0.76 | |||||||
(66% + 34%)1) for training | MCC: 0.40 | |||||||
1323 | ||||||||
(620+703) for testing | ||||||||
Source codes are available at |
||||||||
PROTSOLM | E. coli | Multi-modal model: two input models, ESM2 for protein sequence embedding and equivariant graph neural networks (EGNNs) for structural feature encoding, combined with an output model of a deep learning classifier. Gradient boosting machine | Sequence embeddings (ESM2-650M), inter-residue distances, backbone geometry, physicochemical properties (charged residues, GRAVY index, and turn-forming residues), secondary structure content, solvent accessibility, hydrogen bond density, hydrophobicity of exposed residues, and structure confidence (pLDDT from ESMFold). | Independent test: | Multimodal approach integrates sequence and structure data | Dependence on structural data may limit broad applicability | ||
64598 | Accuracy: 0.79 | |||||||
(33,763+30,835) for training | MCC: 0.58 | |||||||
3230 | Independent test: | |||||||
(1,675+1,555) for testing | Accuracy: 0.60 | |||||||
MCC: 0.22 | ||||||||
2155 | Independent test: | |||||||
(951+1,204)6) for testing | Accuracy: 0.60 | |||||||
MCC: 0.23 | ||||||||
1784 | ||||||||
(1,052+732)6) for testing | ||||||||
3640 | Independent test: | |||||||
(1,817+1,823)6) for testing | Accuracy: 0.60 | |||||||
MCC: 0.21 | ||||||||
Source codes are available at |
||||||||
PLM_Sol | E. coli | Two input embedding models, ProtT5 and ESM2, are combined with an output model of biLSTM_TextCNN layer | Protein sequence embeddings (capturing contextual information such as residue-level interactions and sequence structure). | Independent test: | Leverages PLMs for richer contextual embeddings | Computationally intensive, dependent on large datasets | ||
79344 | Accuracy: 0.72 | |||||||
(47,291+32,053) for training | MCC: 0.46 | |||||||
4000 | ||||||||
(2,000+2,000) for testing | ||||||||
Source codes are available at |
||||||||
DeepSoluE | E. coli | Long Short-Term Memory network | Physicochemical properties (isoelectric point, aromaticity, molecular weight, flexibility, and instability index), sequence embedding, and secondary structure content, along with structural-based features (protein sequence length, residue-level solvent accessibility, and torsion angle domain). | Independent test: | Balanced approach, integrates physicochemical features | Moderate accuracy and MCC, less suited for high-precision tasks | ||
11436 | Accuracy: 0.59 | |||||||
(5,718+ 5,718) for training | MCC: 0.18 | |||||||
3100 | ||||||||
(1,550+1,550) for testing | ||||||||
Web-based tool and datasets are available at |
||||||||
Regression models | SOLart | E. coli | Random forest | Compositions of amino acids, secondary structure content, protein length, protein solvent accessibility, and statistical potentials (residue-level solvent accessibility and torsion angle domain). | Independent tests: | Accurate for quantitative solubility predictions, strong cross-species performance | Limited by dependence on 3D structural data | |
4065) for training | on E. coli | |||||||
R2: 0.448 | ||||||||
E. coli | RMSE: 23% | |||||||
5505) for testing | on S. cerevisiae | |||||||
R2: 0.608, 0.490 | ||||||||
S. cerevisiae | RMSE: 23%, 20% | |||||||
59 and 505) for testing | ||||||||
Web-based tool is available at |
||||||||
SVR Model | E. coli | Support Vector Regression | Compositions of amino acids | Independent tests: | Efficient solubility optimization, successful experimental validation and versatile applicability | No internal test dataset and limited consideration of stability | ||
3,1485) for training | on E. coli | |||||||
R2: 0.57 | ||||||||
4 proteins5) | ||||||||
for experimental validation | ||||||||
Source codes are available at |
||||||||
GraphSol | E. coli | Graph convolutional network | Hidden Markov model, PSSM, diverse physicochemical properties (steric parameters, hydrophobicity, volume, polarizability, isoelectric point, etc.), relative solvent accessible surface area, backbone torsion angles, protein contact map, etc. | Independent tests: | Strong integration of sequence and structure | Dependent on structural data, limiting generalizability | ||
20525) for training | on E. coli | |||||||
R2: 0.48 | ||||||||
E. coli | on S. cerevisiae | |||||||
6855) for testing | R2: 0.37 | |||||||
S. cerevisiae | ||||||||
1085) for testing | ||||||||
Source codes are available at |
Tool name | Number of features | Feature | URL | References |
---|---|---|---|---|
PROFEAT | < 2,000 | Residue compositions, physicochemical properties, sequence order and secondary structures, topological characteristics, interaction patterns, and other network properties | ||
iFeatureOmega | > 18,000 | Residue compositions, physicochemical properties, sequence order and secondary structures, half sphere exposure, residue depth, atom composition and network-based index | ||
protr | 22700 | Residue compositions, physicochemical properties, secondary structure, similarity score, customizable descriptors (AAindex database), Auxiliary functions | ||
Rcpi | > 10,000 | Residue composition, physicochemical properties, secondary structures, PSSM profile, PCM, GO similarity, sequence similarity. Rcpi also provides compound-related features and protein-compound/protein-protein interactions features | ||
Propy | 9547 | Residue compositions, physicochemical properties, sequence order coupling numbers, pseudo amino acids compositions. | ||
PDBparam | > 50 | Physicochemical properties, secondary structures, inter-residue interactions, identification of binding sites from PDB structure | ||
POSSUM | 12010 | PSSM-based features | ||
Pfeature | 200,000+ | Diverse sequence-based features, binary profiles, evolutionary information based on PSSM, structural features, and pattern-based features |
1) Only percentages or ratios have been reported. 2) The ratio of soluble and insoluble data has not been reported. 3) External qualitative solubility dataset from the study of Chang et al. (2014). 4) Mathew’s correlation coefficient (MCC), a balanced accuracy for imbalanced dataset. 5) Quantitative solubility datasets for regression model training and testing. 6) External qualitative solubility dataset from literature of Tan et al., Niwa et al., and Smialowski et al., respectively (
*Not accessible at the time of manuscript preparation.