I am a senior scientist in the Machine Learning Group of Yahoo! Research. My research is mainly focused on the development of scalable kernel methods.
Partial citations of my papers in Computer Science Citeseer Research Index
Contact: selvarak at yahoo-inc dot com
Old webpage: http://guppy.mpe.nus.edu.sg/~mpessk/
2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 and Earlier
In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2-loss functions. The proposed method is simple and reaches an $\epsilon$-accurate solution in $O(\log (1/\epsilon))$ iterations. Experiments indicate that our method is much faster than state of the art solvers such as \Pegasos, \Tron, \svmperf, and a recent primal coordinate descent implementation.
A recent comparison of various algorithms for sequence labeling by Nguyen and Guo indicated that SVM-struct has much superior generalization performance than CRFs. In this short report we point out that the above difference mainly arises because that comparison employed different softwares that use different internal feature functions. When the two methods are compared using identical feature functions they do turn out to have quite close peak performance.
We propose and study algorithms using leave-one-out cross validation (LOO-CV) based predictive measures namely, LOO-CV error (LOO-CVE), Geisser's surrogate Predictive Probability (GPP) and Predictive Mean Squared Error (GPE) to select basis vectors for building sparse Gaussian process regression models. While the LOO-CVE measure uses only predictive mean information, the GPP and GPE measures use predictive variance information as well. All of these three LOO-CV based predictive measures can be used to find the number of basis vectors in the model automatically. The computational complexities of these measures are same as that of marginal likelihood and approximate log posterior probablity maximization approaches. We also give an efficient cache implementation for all the algorithms which gives similar or better generalization performance with lesser number of basis vectors. The experimental results on several real world benchmark datasets show better or comparable generalization performance over existing approaches.
We study a comprehensive framework for sponsored search which incorporates advertiser budgets, query frequency forecasts, and pricing and ranking schemes. We propose a linear program for optimizing revenue (or the total value to advertisers) that has an exponential number of variables; however, we describe how it can be solved efficiently using column generation. The formulation is easily extendable to various levels of problem complexity, adaptable to dynamic environments, fast, and works well in terms of practical considerations. Simulations show significant improvements in revenue and efficiency.
We present an algorithm for constructing an optimal slate of sponsored search advertisements which respects the ordering that is the outcome of a generalized second price auction, but which must also accommodate complicating factors such as overall budget constraints. The algorithm is easily fast enough to use on the fly for typical problem sizes, or as a subroutine in an overall optimization.
Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S^3VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S^3VMs. This paper reviews key ideas in this literature. The performance and behavior of various S^3VM algorithms is studied together, under a common experimental setting.
Large-scale logistic regression arises in many applications such document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations.
We consider the problem of utilizing unlabeled data for Gaussian process inference. Using a geometrically motivated data-dependent prior, we propose a graph-based construction of semi-supervised Gaussian processes. We demonstrate this approach empirically on several classification problems.
We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold cross-validation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations.
Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent prior for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm.
Semi-supervised SVMs (S3VMs) attempt to learn low-density separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is non-convex. To examine the full potential of S3VMs modulo local minima problems in current implementations, we apply branch and bound techniques for obtaining exact, globally optimal solutions. Empirical evidence suggests that the globally optimal solution can return excellent generalization performance in situations where other implementations fail completely. While our current implementation is only applicable to small datasets, we discuss variants that can potentially lead to practically useful algorithms.
Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax) to approximate the SVM primal cost function well; (3) it is efficient and roughly scales as O(n dmax^2) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.
In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on some benchmark and real-world data sets, including applications of ordinal regression to information retrieval, verify the usefulness of these approaches.See Wei Chu's homepage for a code.
In this paper we propose a fast incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross validation approach. The performances on synthetic and real-world datasets are compared with other incremental algorithms such as Tipping and Faul's fast Relevance Vector Machine, Chen et al's Orthogonal Least Squares and Orr's Regularized Forwards Selection. The results demonstrate that the proposed approach is competitive.
Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many Information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ modified finite Newton techniques recently developed by Keerthi and DeCoste. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a mean field annealing (MFA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.
This is a revised version of the above SIGIR 2006 paper, with updated experimental results based on the SVMlin solver written by Vikas Sindhwani.
An intuitive approach to utilizing unlabeled data in kernel-based classification algorithms is to simply treat the unknown labels as additional optimization variables. For margin-based loss functions, one can view this approach as attempting to learn low-density separators. However, this is a hard optimization problem to solve in typical semi-supervised settings where unlabeled data is abundant. The popular Transductive SVM algorithm is a label-switching-retraining procedure that is known to be susceptible to local minima. In this paper, we present a global optimization framework for semi-supervised Kernel machines where an easier problem is parametrically deformed to the original hard problem and minimizers are smoothly tracked. Our approach is motivated from deterministic annealing techniques and involves a sequence of convex optimization problems that are exactly and efficiently solved. We present empirical results on several synthetic and real world datasets that demonstrate the effectiveness of our approach.
Sequential minimal optimization (SMO) is one popular algorithm for training support vector machine (SVM), but it still requires a large amount of computation time for solving large size problems. This paper proposes one parallel implementation of SMO for training SVM. The parallel SMO is developed using message passing interface (MPI). Specifically, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used. There are also satisfactory results on the Web data set.
This paper gives an efficient algorithm for tracking the solution curve of sparse logistic regression with respect to the L1 regularization parameter. The algorithm is based on approximating the logistic regression loss by a piecewise quadratic function, using Rosset and Zhu's path tracking algorithm on the approximate problem, and then applying a correction to get to the true path. Application of the algorithm to text classification and sparse kernel logistic regression shows that the algorithm is efficient.
This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight, SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber's loss function and the L1 loss function, and also for solving ordinal regression.
In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions.See Wei Chu's homepage for a code.
This paper gives a new iterative algorithm for kernel logistic regression. It is based on the solution of a dual problem using ideas similar to those of the Sequential Minimal Optimization algorithm for Support Vector Machines. Asymptotic convergence of the algorithm is proved. Computational experiments show that the algorithm is robust and fast. The algorithmic ideas can also be used to give a fast dual algorithm for solving the optimization problem arising in the inner loop of Gaussian Process classifiers.
The Least Square Support Vector Machines (LS-SVM) formulation corresponds to the solution of a linear system of equations. Several approaches to its numerical solutions have been proposed in the literature. In this paper, we propose an improved method to the numerical solution of LS-SVM and show that the problem can be solved using one reduced system of linear equations. Compared with the existing algorithm of Suykens et al for LS-SVM, our approach is about twice as efficient. Numerical results using the proposed method are provided for comparisons with other existing algorithms.
In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on benchmark datasets verify the usefulness of these approaches.See Wei Chu's homepage for a code.
In this paper we generalize the LARS feature selection method to the linear SVM model, derive an efficient algorithm for it, and empirically demonstrate its usefulness as a feature selection tool for text classification.
Multiclass SVMs are usually implemented by combining several two-class SVMs. The one-versus-all method using winner-takes-all strategy and the one-versus-one method implemented by max-wins voting are popularly used for this purpose. In this paper we give empirical evidence to show that these methods are overall inferior to another one-versus-one method: one that uses Platt's posterior probabilities together with the pairwise coupling idea of Hastie and Tibshirani, especially when the training dataset is sparse.
In this paper, we use soft insensitive loss function in likelihood evaluation, and describe a Bayesian framework in a stationary Gaussian process. Bayesian methods are used to implement model adaptation, while keeping the merits of support vector regression, such as quadratic programming and sparseness. Moreover, confidence interval is provided in prediction.See Wei Chu's homepage for a code.
To be inserted.
In this paper we give an efficient method for computing the leave-one-out (LOO) error for support vector machines (SVMs) with Gaussian kernels quite accurately. It is particularly suitable for iterative decomposition methods of solving SVMs. The importance of various steps of the method is illustrated in detail by showing the performance on six benchmark datasets. The new method often leads to speedups of 10 to 50 times compared to standard LOO error computation. It has good promise for use in hyperparameter tuning and model comparison.
To be inserted.
This paper gives a new and efficient algorithm for the sparse logistic regression problem. The proposed algorithm is based on the Gauss-Seidel method and is asymptotically convergent. It is simple and extremely easy to implement; it neither uses any sophisticated mathematical programming software nor needs any matrix operations. It can be applied to a variety of real-world problems like identifying marker genes and building a classifier in the context of cancer diagnosis using microarray data.The gene selection method suggested in this paper is demonstrated on four real-world datasets and found to be consistent with the literature or biological knowledge of these sets.
Support vector machines (SVMs) with the Gaussian (RBF) kernel have been popular for practical use. Model selection in this class of SVMs involves two hyperparameters: the penalty parameter, C and the kernel width, sigma. This paper analyzes the behavior of the SVM classifier when these hyperparameters take very small or very large values. Our results help in a good understanding of the hyperparameter space that leads to an efficient heuristic method of searching for hyperparameter values with small generalization errors.One of the key theoretical results of this paper is that, as sigma goes to infinity the SVM classifier with gaussian kernel tends to the linear SVM classifier. This is an important result since it means that, a gaussian SVM designed to minimize the generalization error over all possible C and sigma has to be equal or better in performance than a linear SVM classifier with C tuned.
In this paper, we propose a multi-category classification method that combines binary classifiers through soft-max function. Posteriori probabilities are also obtained. Both, one-versus-all and one-versus-one classifiers can be used in the combination. Empirical comparison shows that the proposed method is competitive with other implementations of one-versus-all and one-versus-one methods in terms of both classification accuracy and posteriori probability estimate.
This paper describes Bayesian techniques for support vector classification. In particular, we propose a novel differentiable loss function called the trigonometric loss function which has the desirable characteristic of natural normalization in the likelihood function, and then follow standard Gaussian processes techniques to set up a Bayesian framework. In this framework, Bayesian inference is used to implement model adaptation, while keeping the merits of support vector classifier, such as sparseness and convex programming. This differs from standard Gaussian processes for classification. Moreover, we put forward class probability in making predictions. Experimental results on benchmark data sets indicate the usefulness of this approach.For a code, download this directory: bisvm.zip.
This paper extends the well-known SMO algorithm of Support Vector Machines (SVMs) to Least Squares SVM formulations which include LS-SVM classification, Kernel Ridge Regression and a particular form of regularized Kernel Fisher Discriminant. The algorithm is shown to be asymptotically convergent. It is also extremely easy to implement. Computational experiments show that the algorithm is fast and scales efficiently (quadratically) as a function of the number of examples.
Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims' Xi-Alpha bound and Wahba et al's GACV come next and perform reasonably well. For SVMs with L2 soft margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.Follow this link for details: comparison.html.
To be inserted.
In collaboration with DTI Institute and Genome Institute of Singapore our Datamining group took part in Task 1 (Information extraction from Biomedical articles) of KDD Cup 2002 and got a honourable mention. Click here for a short article written for the university monthly magazine. For a more detailed article that was published in SIGKDD Explorations click here.
In this short report I discuss implementation issues related to the tuning of the hyperparameters of a Support Vector Machine (SVM) with L2 soft margin, for which the radius/margin bound is taken as the index to be minimized and iterative techniques are employed for computing radius and margin. The implementation is shown to be feasible and efficient even for large problems having more than 10,000 support vectors.A report discussing the implementation issues, together with an old code running on Matlab 5.3 interface, can be downloaded from the following page: nparm.html.
Convergence of a generalized version of the modified SMO algorithms given by Keerthi et al for SVM classifier design is proved. The convergence results are also extended to modified SMO algorithms for solving nu-SVM classifier problems.There is a slight flaw in the final argument in the convergence proof of the above paper. This has been corrected by Takahashi and Nishi (IEEE Transactions on Neural Networks, Vol. 16, No. 3, pp. 774-776, May 2005).
This paper gives a new iterative algorithm for kernel logistic regression. It is based on the solution of the dual problem using ideas similar to those of the SMO algorithm for Support Vector Machines. Asymptotic convergence of the algorithm is proved. Computational experiments show that the algorithm is robust and fast. The algorithmic ideas can also be used to give a fast dual algorithm for solving the optimization problem arising in the inner loop of Gaussian Process classifiers.
In this paper we use a unified loss function, called the soft insensitive loss function, for Bayesian support vector regression. We follow standard Gaussian processes for regression to set up the Bayesian framework, in which the unified loss function is used in the likelihood evaluation. Under this framework, the maximum a posteriori estimate of the function values corresponds to the solution of an extended support vector regression problem. The overall approach has the merits of support vector regression such as convex quadratic programming and sparsity in solution representation. It also has the advantages of Bayesian methods for model adaptation and error bars for its predictions. Experimental results on simulated and real-world data sets indicate that the approach works well even on large data sets.
Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims' Xi-Alpha bound and Wahba et al's GACV come next and perform reasonably well. For SVMs with L2 soft margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.Follow this link for details: comparison.html.
The chief aim of this paper is to propose mean-field approximations for a broad class of Belief networks, of which sigmoid and noisy-or networks can be seen as special cases. The approximations are based on a powerful mean-field theory suggested by Plefka. We show that Saul, Jaakkola and Jordan' s approach is the first order approximation in Plefka's approach, via a variational derivation. The application of Plefka's theory to belief networks is not computationally tractable. To tackle this problem we propose new approximations based on Taylor series. Small scale experiments show that the proposed schemes are attractive.
To be inserted.
This paper points out an important source of confusion and inefficiency in Platt's Sequential Minimal Optimization (SMO) algorithm that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO. These modified algorithms perform significantly faster than the original SMO on all benchmark datasets tried.The modified SMO algorithm described in the above paper has been popularly used. For example, the LIBSVM code uses a variation of this algorithm.
Click here for a postscript file containing the original technical report with pseudocode.
Gaussian Processes are powerful non-parametric models specified by parameterized mean and covariance functions. For regression they are simpler and better than Neural Networks. Standard approaches to choose hyperparameters in Gaussian Processes are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. Recently we have proposed and investigated predictive approaches based on Geisser's Predictive Sample Reuse (PSR) methodology and the related Stone's Cross Validation (CV) methodology. More specifically, we have derived results for Geisser's surrogate Predictive Probability (GPP), Geisser's Predictive mean square Error (GPE) and the standard CV error and have made a comparative study. Within an approximation we have also arrived at the Generalized Cross Validation (GCV) and established its relationship with the GPP and GPE approaches. These approaches have been tested on a number of benchmark problems. Experimental results show that these approaches are strongly competitive to the existing approaches.
To be inserted.
Gaussian Processes are powerful non-parametric models specified by parameterized mean and covariance functions. For regression they are simpler and better than Neural Networks. Standard approaches to choose hyperparameters in Gaussian Processes are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. Recently we have proposed and investigated predictive approaches based on Geisser's Predictive Sample Reuse (PSR) methodology and the related Stone's Cross Validation (CV) methodology. More specifically, we have derived results for Geisser's surrogate Predictive Probability (GPP), Geisser's Predictive mean square Error (GPE) and the standard CV error and have made a comparative study. Within an approximation we have also arrived at the Generalized Cross Validation (GCV) and established its relationship with the GPP and GPE approaches. These approaches have been tested on a number of benchmark problems. Experimental results show that these approaches are strongly competitive to the existing approaches.
To be inserted.
In this report we give a new, fast iterative algorithm for support vector machine (SVM) classifier design. The basic problem treated is one that does not allow classification violations. The problem is converted to a problem of computing the nearest point between two convex polytopes. The suitability of two classical nearest point algorithms, due to Gilbert, and Mitchell, Dem'yanov and Malozemov, is studied. Ideas from both these algorithms are combined and modified to derive our fast algorithm. For problems which require classification violations to be allowed, the violations are quadratically penalized and an idea due to Friess is used to convert it to a problem in which there are no classification violations. Comparative computational evaluation of our algorithm against powerful SVM methods such as Platt's Sequential Minimal Optimization shows that our algorithm is very competitive.Click here for a postscript file containing the original technical report with pseudocode.
Click here for a fortran code.
In this paper, a stochastic connectionist approach is proposed for solving function optimization problems with real-valued parameters. With the assumption of increased processing capability of a node in the connectionist network, we show how a broader class of problems can be solved. As the proposed approach is a stochastic search technique, it avoids getting stuck in local optima. Robustness of the approach is demostrated on several multi-modal functions with different numbers of variables. Optimization of a well-known partitional clustering criterion, the squared-error criterion (SEC), is formulated as a function optimization problem and is solved using the proposed approach. This approach is used to cluster selected data sets and the results obtained are compared with that of the K-means algorithm and a simulated annealing (SA) approach. The amenability of the connectionist approach to parallelization enables effective use of parallel hardware.
This paper points out an important source of inefficiency in Smola and Scholkopf's Sequential Minimal Optimization (SMO) algorithm for SVM regression that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO for regression. These modified algorithms perform significantly faster than the original SMO on the datasets tried.
An alternate derivation of Plefka's method is presented which is obtained from minimizing Gibbs energy. It is shown that this method can be rederived using a perturbation expansion of Kullback-Leibler divergence, the latter having a nice information geometric interpretation.
Last updated: July 31, 2006