Primal sparse SVMs

Support vector machines, SVMs, are considered the state of the art in machine learning due to the excellent performance provided in a wide range of applications. However, in many cases, SVMs employes useless, redundant or noisy input features which can degrade, and very significantly, the provided solution. This is caused by the fact that the SVM solution is formed as a combination of all input features including irrelevant ones.

So that the SVM design only employs the most relevant features is common to apply a feature selection process, as the Recursive Feature Elimination (RFE) algorithm, or preferably to change the formulation of the SVM to obtain sparse solutions. Following this philosophy we can find the L1 SVM, which replaces the quadratic regularization term with L1 penalty, or the Doubly reguralized SVM (Dr-SVM), which includes both penalty terms.

Despite these new versions retain most of the good properties of the conventional SVMs, in certain situations they are not capable of providing the desired solution:

  1. If the correlation between relevant and irrelevant variables is not zero, the consistency of the solution canont be ensured.
  2. If there are more features than available samples (d >> N), L1 regularization cannot select more than N variables.

Furthermore, they present the difficulty of extending these formulations to the selection of predefined groups or multiclass problems.

This paper presents a new formulation of the linear SVM which directly forces the obtained  solution to be sparse. To do so, rather than modifying the objective function with regularization terms, it includes a number of additional restrictions in the optimization problem enabling automatically identify irrelevant features. These restrictions are based on the use of standard ε insensitive ("ε-insensitive") used by regression SVMs.

This problem can be easily adapted to a new μ-SVM formulation and set parameter μ predefining the number of features that should be the final solution. These sparsity restrictions can be incorporated in both standard L2 SVM as in the version with L1 penalty term, providing two versions of this algorithm: μ-L2-SVM and μ-L1-SVM.

Comparing the performances of these methods in terms of error and number of selected features versus standard SVM versions and RFE method, we can observe that when parameter μ is correctly set, the proposed algorithm is able to eliminate all the irrelevant features from the final model, obtaining simpler models (with fewer variables) and also more accurate solutions.

Finally, it is important to point that, as stated in detail in [1], the proposed formulation can be applied to isolated feature selection or, if it is necessary, to the selection of predefined feature groups or to multiclass problems, obtaining in these cases, similar conclusions than in the binary case.


[1] Vanessa Gómez-Verdejo, Manel Martínez Ramón, Jerónimo Arenas-García, Miguel Lázaro-Gredilla, Harold Molina-Bulla (2011). Support Vector Machines with constraints for sparsity in the primal parameters. IEEE Transactions on Neural Networks, 220: 1269-1283.