Introduction
Phosphorylation is the most essential post-translational modification in eukaryotes and in particular plays a crucial role in a wide range of cellular processes. While, experiments on phosphorylation site discovery are time consuming and expensive to perform. Therefore, computational prediction methods becomes more popular as an important complementary approach in protein phosphorylation site study. The prediction tools can be grouped into two categories: Kinase-specific and non-kinase-specific tools. A kinase-specific prediction program requires as input both a protein sequence and the type of a kinase, and produces some measure of the likelihood that each S/T/Y residue in the sequence is phosphorylated by the chosen kinase. In contrast, a non-kinase-specific prediction tool requires only a protein sequence as input, and reports the likelihood that each S/T/Y residue is phosphorylated by any possible kinase. Non-kinase-specific tools may be able to detect phosphorylation sites for which the associated kinase is unknown or the number of known substrate sequences of the associated kinase is few. With the development of sequencing technology, there is an increase demand for non-kinase-specific tools, but the current state for them is not satisfying in both quality and quantity. In this work, we developed a non-kinase-specific protein phosphorylation site prediction method that uses random forest classifier to integrate nine different sequence level scores. These sequence-based features are Shannon entropy (SE), relative entropy (RE), predicted protein secondary structure (SS), predicted protein disorder (PD), accessible surface area (ASA), overlapping properties (OP), averaged cumulative hydrophobicity (ACH), and k-nearest neighbor (KNN). By carefully optimized parameter and sliding window size, our method achieved AUC values 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites in animals in a ten-fold cross-validation.
Citation: Y. Dou, B. Yao, C. Zhang. PhosphoSVM: Prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino Acids (2014); DOI: 10.1007/s00726-014-1711-5