Protein solubility can be a decisive factor in both research and production efficiency, and in silico sequence-based predictors that can accurately estimate solubility outcomes are highly sought. The goal of the project is to build classifiers which use sequence and structural features to predict recombinant protein solubility. For this purpose, we proposed two methods. One using Gradient Boosting Machines (GBMs), and another using deep Convolutional Neural Networks (CNNs).
Parsnip, which uses GBM, achieved state of the art accuracy of 0.74 and Matthew's Correlation Coefficient (MCC) of 0.48 for protein solubility. It also identified that tri-peptide stretches with multiple histidines tend to negatively correlate with solubility.
DeepSol exploits the k-mer structure using multiple convolution filters of varying lengths to achieve an accuracy of 0.77 and an MCC of 0.55. DeepSol is currently the state of the art in sequence based protein solubility prediction.