I am a principal scientist in the Cloud and Information Services Lab (CISL, pronounced as sizzle), a new applied science group in Microsoft with Raghu Ramakrishnan as the head. I am located in Mountain View, CA. From Jan 2004-Apr 2012 I was with the Machine Learning Group of Yahoo! Research, in Santa Clara, CA. My recent research has mainly focused on the development of scalable algorithms for large margin methods including those for structured outputs, and information extraction.
Contact: keerthi at microsoft dot com
Citations of my papers in Microsoft Academic Search
Citations of my papers in Google Scholar
2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 and Earlier
Transductive SVM (TSVM) is a well known semi-supervised large margin learning method for binary text classification. In this paper we extend this method to multi-class and hierarchical classification problems. We point out that the determination of labels of unlabeled examples with fixed classifier weights is a linear programming problem. We devise an efficient technique for solving it. The method is applicable to general loss functions. We demonstrate the value of the new method using large margin loss on a number of multiclass and hierarchical classification datasets. For maxent loss we show empirically that our method is better than expectation regularization/constraint and posterior regularization methods, and competitive with the version of entropy regularization method which uses label constraints.
Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster than the state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.
We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.
In this paper we propose a new approach for semi-supervised structured output learning. Our approach uses relaxed labeling on unlabeled data to deal with the combinatorial nature of the label space and further uses domain constraints to guide the learning. Since the overall objective is non-convex, we alternate between the optimization of the model parameters and the label distribution of unlabeled data. The alternating optimization coupled with deterministic annealing helps us achieve better local optima and as a result our approach leads to better constraint satisfaction during inference. Experimental results on sequence labeling benchmarks show superior performance of our approach compared to Constraint Driven Learning (CoDL) and Posterior Regularization (PR).
We consider the problem of learning structured output probabilistic models with training examples having partial labels. Partial label scenarios arise commonly in web applications such as taxonomy (hierarchical) classification, multi-label classification and information extraction from web pages. For example, label information may be available only at the internal node level (not at the leaf level) for some pages in a taxonomy classification problem. In a multi-label classification problem, it may be available only for some of the classes (in each example). Similarly, in a sequence learning problem, we may have label information only for some nodes in the training sequences. Conventionally, marginal likelihood maximization technique has been used to solve these problems. In such a solution unlabeled examples and any side information like expected label distribution (or correlation in a multi-label setting) of the unlabeled part are not used. We solve these problems by incorporating entropy and label distribution or correlation regularizations along with marginal likelihood. Entropy and label distribution regularizations have been used previously in semi-supervised learning with fully unlabeled examples. In this paper we develop probabilistic taxonomy and multi-label classifier models, and provide the ideas needed for expanding their usage to the partial label scenario. Experiments on real-life taxonomy and multilabel learning problems show that significant improvements in accuracy are achieved by incorporating these regularizations, when most of the examples are only partially labeled.
In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to computational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classification. In the last few years, large margin classifiers like support vector machines (SVMs) have shown much promise for structured output learning. The related optimization problem is a convex quadratic program (QP) with a large number of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes repeated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems. Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.
Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.
A large fraction of binary classification problems arising in web applications are of the type where the positive class is well defined and compact while the negative class comprises everything else in the distribution for which the classifier is developed; it is hard to represent and sample from such a broad negative class. Classifiers based only on positive and unlabeled examples reduce human annotation effort significantly by removing the burden of choosing a representative set of negative examples. Various methods have been proposed in the literature for building such classifiers. Of these, the state of the art methods are Biased SVM and Elkan & Noto's methods. While these methods often work well in practice, they are computationally expensive since hyperparameter tuning is very important, particularly when the size of labeled positive examples set is small and class imbalance is high. In this paper we propose a pairwise ranking based approach to learn from positive and unlabeled examples (LPU) and we give a theoretical justification for it. We present a pairwise RankSVM (RSVM) based method for our approach. The method is simple, efficient, and its hyperparameters are easy to tune. A detailed experimental study using several benchmark datasets shows that the proposed method gives competitive classification performance compared to the mentioned state of the art methods, while training 3-10 times faster. We also propose an efficient AUC based feature selection technique in the LPU setting and demonstrate its usefulness on the datasets. To get an idea of the goodness of the LPU methods we compare them against supervised learning (SL) methods that also make use of negative examples in training. SL methods give a slightly better performance than LPU methods when there is a rich set of negative examples; however, they are inferior when the number of negative training examples is not large enough.
In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we ex- plore and adapt binary SVM methods that make use of un- labeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empir- ically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.
In this paper we provide a principled approach to solve a transductive classification problem involving a similar graph (edges tend to connect nodes with same labels) and a dissimilar graph (edges tend to connect nodes with opposing labels). Most of the existing methods, e.g., Information Regularization (IR), Weighted vote Relational Neighbor classifier (WvRN) etc, assume that the given graph is only a similar graph. We extend the IR and WvRN methods to deal with mixed graphs. We evaluate the proposed extensions on several benchmark datasets as well as two real world datasets and demonstrate the usefulness of our ideas.
Talk that descibes approaches to large scale information extraction from the web.
RankSVM (Herbrich et al, 2000; Joachims, 2002) is a pairwise method for designing ranking models. SVMLight is the only publicly available software for RankSVM. It is slow and, due to incomplete training with it, previous evaluations show RankSVM to have inferior ranking performance. We propose new methods based on primal Newton method to speed up RankSVM training and show that they are 5 orders of magnitude faster than SVMLight. Evaluation on the Letor benchmark datasets after complete training using such methods shows that the performance of RankSVM is excellent.
In this paper we consider the problem of collectively classifying entities where relational information is available across the entities. In practice inaccurate class distribution for each entity is often available from another (external) classifier. For example this distribution could come from a classifier built using content features or a simple dictionary. Given the relational and inaccurate external classifier information, we consider two graph based settings in which the problem of collective classification can be solved. In the first setting the class distribution is used to fix labels to a subset of nodes and the labels for the remaining nodes are obtained like in a transductive setting. In the other setting the class distributions of all nodes are used to define the fitting function part of a graph regularized objective function. We define a generalized objective function that handles both the settings. Methods like harmonic Gaussian field and local-global consistency (LGC) reported in the literature can be seen as special cases. We extend the LGC and weighted vote relational neighbor classification (WvRN) methods to support usage of external classifier information. We also propose an educient least squares regularization (LSR) based method and relate it to information regularization methods. All the methods are evaluated on several benchmark and real world datasets. Considering together speed, robustness and accuracy, experimental results indicate that the LSR and WvRN-extension methods perform better than other methods.
In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such large-scale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1- and L2-loss functions. The proposed method is simple and reaches an $\epsilon$-accurate solution in $O(\log (1/\epsilon))$ iterations. Experiments indicate that our method is much faster than state of the art solvers such as \Pegasos, \Tron, \svmperf, and a recent primal coordinate descent implementation.
Efficient training of direct multi-class formulations of linear Support Vector Machines is very useful in applications such as text classification with a huge number examples as well as features. This paper presents a fast dual method for this training. The main idea is to sequentially traverse through the training set and optimize the dual variables associated with one example at a time. The speed of training is enhanced further by shrinking and cooling heuristics. Experiments indicate that our method is much faster than state of the art solvers such as bundle, cutting plane and exponentiated gradient methods.
Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S^3VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S^3VMs. This paper reviews key ideas in this literature. The performance and behavior of various S^3VM algorithms is studied together, under a common experimental setting.
We consider feature selection in a multi-class setting where the goal is to find a small set of features for all the classes simultaneously. We develop an embedded method for this problem using the idea of scaling factors. Training involves the solution of a convex program for which we give a scalable algorithm. The method is closely related to extensions of L1 regularization and recursive feature elimination. These methods are shown to be very ective in text classification.
We study a comprehensive framework for sponsored search which incorporates advertiser budgets, query frequency forecasts, and pricing and ranking schemes. We propose a linear program for optimizing revenue (or the total value to advertisers) that has an exponential number of variables; however, we describe how it can be solved efficiently using column generation. The formulation is easily extendable to various levels of problem complexity, adaptable to dynamic environments, fast, and works well in terms of practical considerations. Simulations show significant improvements in revenue and efficiency.
Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM).
The PASCAL challenge was conducted in Feb-June 2008 to evaluate the scalability and efficiency of existing ML approaches with respect to computational, memory or communication resources. This set of slides describes our participation using some SVM techniques that we developed.
In this paper we consider the problem of Gaussian process classifier (GPC) model selection with different Leave-One-Out (LOO) Cross Validation (CV) based optimization criteria and provide a practical algorithm using LOO predictive distributions with such criteria to select hyperparameters. Apart from the standard average negative logarithm of predictive probability (NLP), we also consider smoothed versions of criteria such as F-measure and Weighted Error Rate (WER), which are useful for handling imbalanced data. Unlike the regression case, LOO predictive distributions for the classifier case are intractable. We use approximate LOO predictive distributions arrived from Expectation Propagation (EP) approximation. We conduct experiments on several real world benchmark datasets. When the NLP criterion is used for optimizing the hyperparameters, the predictive approaches show better or comparable NLP generalization performance with existing GPC approaches. On the other hand, when the F-measure criterion is used, the F-measure generalization performance improves significantly on several datasets. Overall, the EP-based predictive algorithm comes out as an excellent choice for GP classifier model selection with different optimization criteria.
A recent comparison of various algorithms for sequence labeling by Nguyen and Guo indicated that SVM-struct has much superior generalization performance than CRFs. In this short report we point out that the above difference mainly arises because that comparison employed different softwares that use different internal feature functions. When the two methods are compared using identical feature functions they do turn out to have quite close peak performance.
We propose and study algorithms using leave-one-out cross validation (LOO-CV) based predictive measures namely, LOO-CV error (LOO-CVE), Geisser's surrogate Predictive Probability (GPP) and Predictive Mean Squared Error (GPE) to select basis vectors for building sparse Gaussian process regression models. While the LOO-CVE measure uses only predictive mean information, the GPP and GPE measures use predictive variance information as well. All of these three LOO-CV based predictive measures can be used to find the number of basis vectors in the model automatically. The computational complexities of these measures are same as that of marginal likelihood and approximate log posterior probablity maximization approaches. We also give an efficient cache implementation for all the algorithms which gives similar or better generalization performance with lesser number of basis vectors. The experimental results on several real world benchmark datasets show better or comparable generalization performance over existing approaches.
We present an algorithm for constructing an optimal slate of sponsored search advertisements which respects the ordering that is the outcome of a generalized second price auction, but which must also accommodate complicating factors such as overall budget constraints. The algorithm is easily fast enough to use on the fly for typical problem sizes, or as a subroutine in an overall optimization.
Large-scale logistic regression arises in many applications such document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also compare it with existing linear SVM implementations.
We consider the problem of utilizing unlabeled data for Gaussian process inference. Using a geometrically motivated data-dependent prior, we propose a graph-based construction of semi-supervised Gaussian processes. We demonstrate this approach empirically on several classification problems.
We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold cross-validation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations.
Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent prior for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm.
Semi-supervised SVMs (S3VMs) attempt to learn low-density separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is non-convex. To examine the full potential of S3VMs modulo local minima problems in current implementations, we apply branch and bound techniques for obtaining exact, globally optimal solutions. Empirical evidence suggests that the globally optimal solution can return excellent generalization performance in situations where other implementations fail completely. While our current implementation is only applicable to small datasets, we discuss variants that can potentially lead to practically useful algorithms.
Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax) to approximate the SVM primal cost function well; (3) it is efficient and roughly scales as O(n dmax^2) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.
In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on some benchmark and real-world data sets, including applications of ordinal regression to information retrieval, verify the usefulness of these approaches.See Wei Chu's homepage for a code.
In this paper we propose a fast incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross validation approach. The performances on synthetic and real-world datasets are compared with other incremental algorithms such as Tipping and Faul's fast Relevance Vector Machine, Chen et al's Orthogonal Least Squares and Orr's Regularized Forwards Selection. The results demonstrate that the proposed approach is competitive.
Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many Information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ modified finite Newton techniques recently developed by Keerthi and DeCoste. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a mean field annealing (MFA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.
This is a revised version of the above SIGIR 2006 paper, with updated experimental results based on the SVMlin solver written by Vikas Sindhwani.
An intuitive approach to utilizing unlabeled data in kernel-based classification algorithms is to simply treat the unknown labels as additional optimization variables. For margin-based loss functions, one can view this approach as attempting to learn low-density separators. However, this is a hard optimization problem to solve in typical semi-supervised settings where unlabeled data is abundant. The popular Transductive SVM algorithm is a label-switching-retraining procedure that is known to be susceptible to local minima. In this paper, we present a global optimization framework for semi-supervised Kernel machines where an easier problem is parametrically deformed to the original hard problem and minimizers are smoothly tracked. Our approach is motivated from deterministic annealing techniques and involves a sequence of convex optimization problems that are exactly and efficiently solved. We present empirical results on several synthetic and real world datasets that demonstrate the effectiveness of our approach.
Sequential minimal optimization (SMO) is one popular algorithm for training support vector machine (SVM), but it still requires a large amount of computation time for solving large size problems. This paper proposes one parallel implementation of SMO for training SVM. The parallel SMO is developed using message passing interface (MPI). Specifically, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used. There are also satisfactory results on the Web data set.
This paper gives an efficient algorithm for tracking the solution curve of sparse logistic regression with respect to the L1 regularization parameter. The algorithm is based on approximating the logistic regression loss by a piecewise quadratic function, using Rosset and Zhu's path tracking algorithm on the approximate problem, and then applying a correction to get to the true path. Application of the algorithm to text classification and sparse kernel logistic regression shows that the algorithm is efficient.
This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight, SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber's loss function and the L1 loss function, and also for solving ordinal regression.
In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions.See Wei Chu's homepage for a code.
This paper gives a new iterative algorithm for kernel logistic regression. It is based on the solution of a dual problem using ideas similar to those of the Sequential Minimal Optimization algorithm for Support Vector Machines. Asymptotic convergence of the algorithm is proved. Computational experiments show that the algorithm is robust and fast. The algorithmic ideas can also be used to give a fast dual algorithm for solving the optimization problem arising in the inner loop of Gaussian Process classifiers.
The Least Square Support Vector Machines (LS-SVM) formulation corresponds to the solution of a linear system of equations. Several approaches to its numerical solutions have been proposed in the literature. In this paper, we propose an improved method to the numerical solution of LS-SVM and show that the problem can be solved using one reduced system of linear equations. Compared with the existing algorithm of Suykens et al for LS-SVM, our approach is about twice as efficient. Numerical results using the proposed method are provided for comparisons with other existing algorithms.
In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on benchmark datasets verify the usefulness of these approaches.See Wei Chu's homepage for a code.
In this paper we generalize the LARS feature selection method to the linear SVM model, derive an efficient algorithm for it, and empirically demonstrate its usefulness as a feature selection tool for text classification.
Multiclass SVMs are usually implemented by combining several two-class SVMs. The one-versus-all method using winner-takes-all strategy and the one-versus-one method implemented by max-wins voting are popularly used for this purpose. In this paper we give empirical evidence to show that these methods are overall inferior to another one-versus-one method: one that uses Platt's posterior probabilities together with the pairwise coupling idea of Hastie and Tibshirani, especially when the training dataset is sparse.
In this paper, we use soft insensitive loss function in likelihood evaluation, and describe a Bayesian framework in a stationary Gaussian process. Bayesian methods are used to implement model adaptation, while keeping the merits of support vector regression, such as quadratic programming and sparseness. Moreover, confidence interval is provided in prediction.See Wei Chu's homepage for a code.
To be inserted.
In this paper we give an efficient method for computing the leave-one-out (LOO) error for support vector machines (SVMs) with Gaussian kernels quite accurately. It is particularly suitable for iterative decomposition methods of solving SVMs. The importance of various steps of the method is illustrated in detail by showing the performance on six benchmark datasets. The new method often leads to speedups of 10 to 50 times compared to standard LOO error computation. It has good promise for use in hyperparameter tuning and model comparison.
To be inserted.
This paper gives a new and efficient algorithm for the sparse logistic regression problem. The proposed algorithm is based on the Gauss-Seidel method and is asymptotically convergent. It is simple and extremely easy to implement; it neither uses any sophisticated mathematical programming software nor needs any matrix operations. It can be applied to a variety of real-world problems like identifying marker genes and building a classifier in the context of cancer diagnosis using microarray data.The gene selection method suggested in this paper is demonstrated on four real-world datasets and found to be consistent with the literature or biological knowledge of these sets.
Support vector machines (SVMs) with the Gaussian (RBF) kernel have been popular for practical use. Model selection in this class of SVMs involves two hyperparameters: the penalty parameter, C and the kernel width, sigma. This paper analyzes the behavior of the SVM classifier when these hyperparameters take very small or very large values. Our results help in a good understanding of the hyperparameter space that leads to an efficient heuristic method of searching for hyperparameter values with small generalization errors.One of the key theoretical results of this paper is that, as sigma goes to infinity the SVM classifier with gaussian kernel tends to the linear SVM classifier. This is an important result since it means that, a gaussian SVM designed to minimize the generalization error over all possible C and sigma has to be equal or better in performance than a linear SVM classifier with C tuned.
In this paper, we propose a multi-category classification method that combines binary classifiers through soft-max function. Posteriori probabilities are also obtained. Both, one-versus-all and one-versus-one classifiers can be used in the combination. Empirical comparison shows that the proposed method is competitive with other implementations of one-versus-all and one-versus-one methods in terms of both classification accuracy and posteriori probability estimate.
This paper describes Bayesian techniques for support vector classification. In particular, we propose a novel differentiable loss function called the trigonometric loss function which has the desirable characteristic of natural normalization in the likelihood function, and then follow standard Gaussian processes techniques to set up a Bayesian framework. In this framework, Bayesian inference is used to implement model adaptation, while keeping the merits of support vector classifier, such as sparseness and convex programming. This differs from standard Gaussian processes for classification. Moreover, we put forward class probability in making predictions. Experimental results on benchmark data sets indicate the usefulness of this approach.For a code, download this directory: bisvm.zip.
This paper extends the well-known SMO algorithm of Support Vector Machines (SVMs) to Least Squares SVM formulations which include LS-SVM classification, Kernel Ridge Regression and a particular form of regularized Kernel Fisher Discriminant. The algorithm is shown to be asymptotically convergent. It is also extremely easy to implement. Computational experiments show that the algorithm is fast and scales efficiently (quadratically) as a function of the number of examples.
Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims' Xi-Alpha bound and Wahba et al's GACV come next and perform reasonably well. For SVMs with L2 soft margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.Follow this link for details: comparison.html.
To be inserted.
In collaboration with DTI Institute and Genome Institute of Singapore our Datamining group took part in Task 1 (Information extraction from Biomedical articles) of KDD Cup 2002 and got a honourable mention. Click here for a short article written for the university monthly magazine. For a more detailed article that was published in SIGKDD Explorations click here.
In this short report I discuss implementation issues related to the tuning of the hyperparameters of a Support Vector Machine (SVM) with L2 soft margin, for which the radius/margin bound is taken as the index to be minimized and iterative techniques are employed for computing radius and margin. The implementation is shown to be feasible and efficient even for large problems having more than 10,000 support vectors.A report discussing the implementation issues, together with an old code running on Matlab 5.3 interface, can be downloaded from the following page: nparm.html.
Convergence of a generalized version of the modified SMO algorithms given by Keerthi et al for SVM classifier design is proved. The convergence results are also extended to modified SMO algorithms for solving nu-SVM classifier problems.There is a slight flaw in the final argument in the convergence proof of the above paper. This has been corrected by Takahashi and Nishi (IEEE Transactions on Neural Networks, Vol. 16, No. 3, pp. 774-776, May 2005).
This paper gives a new iterative algorithm for kernel logistic regression. It is based on the solution of the dual problem using ideas similar to those of the SMO algorithm for Support Vector Machines. Asymptotic convergence of the algorithm is proved. Computational experiments show that the algorithm is robust and fast. The algorithmic ideas can also be used to give a fast dual algorithm for solving the optimization problem arising in the inner loop of Gaussian Process classifiers.
In this paper we use a unified loss function, called the soft insensitive loss function, for Bayesian support vector regression. We follow standard Gaussian processes for regression to set up the Bayesian framework, in which the unified loss function is used in the likelihood evaluation. Under this framework, the maximum a posteriori estimate of the function values corresponds to the solution of an extended support vector regression problem. The overall approach has the merits of support vector regression such as convex quadratic programming and sparsity in solution representation. It also has the advantages of Bayesian methods for model adaptation and error bars for its predictions. Experimental results on simulated and real-world data sets indicate that the approach works well even on large data sets.
Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims' Xi-Alpha bound and Wahba et al's GACV come next and perform reasonably well. For SVMs with L2 soft margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.Follow this link for details: comparison.html.
The chief aim of this paper is to propose mean-field approximations for a broad class of Belief networks, of which sigmoid and noisy-or networks can be seen as special cases. The approximations are based on a powerful mean-field theory suggested by Plefka. We show that Saul, Jaakkola and Jordan' s approach is the first order approximation in Plefka's approach, via a variational derivation. The application of Plefka's theory to belief networks is not computationally tractable. To tackle this problem we propose new approximations based on Taylor series. Small scale experiments show that the proposed schemes are attractive.
To be inserted.
This paper points out an important source of confusion and inefficiency in Platt's Sequential Minimal Optimization (SMO) algorithm that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO. These modified algorithms perform significantly faster than the original SMO on all benchmark datasets tried.The modified SMO algorithm described in the above paper has been popularly used. For example, the LIBSVM code uses a variation of this algorithm.
Click here for a postscript file containing the original technical report with pseudocode.
Gaussian Processes are powerful non-parametric models specified by parameterized mean and covariance functions. For regression they are simpler and better than Neural Networks. Standard approaches to choose hyperparameters in Gaussian Processes are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. Recently we have proposed and investigated predictive approaches based on Geisser's Predictive Sample Reuse (PSR) methodology and the related Stone's Cross Validation (CV) methodology. More specifically, we have derived results for Geisser's surrogate Predictive Probability (GPP), Geisser's Predictive mean square Error (GPE) and the standard CV error and have made a comparative study. Within an approximation we have also arrived at the Generalized Cross Validation (GCV) and established its relationship with the GPP and GPE approaches. These approaches have been tested on a number of benchmark problems. Experimental results show that these approaches are strongly competitive to the existing approaches.
To be inserted.
Gaussian Processes are powerful non-parametric models specified by parameterized mean and covariance functions. For regression they are simpler and better than Neural Networks. Standard approaches to choose hyperparameters in Gaussian Processes are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. Recently we have proposed and investigated predictive approaches based on Geisser's Predictive Sample Reuse (PSR) methodology and the related Stone's Cross Validation (CV) methodology. More specifically, we have derived results for Geisser's surrogate Predictive Probability (GPP), Geisser's Predictive mean square Error (GPE) and the standard CV error and have made a comparative study. Within an approximation we have also arrived at the Generalized Cross Validation (GCV) and established its relationship with the GPP and GPE approaches. These approaches have been tested on a number of benchmark problems. Experimental results show that these approaches are strongly competitive to the existing approaches.
To be inserted.
In this report we give a new, fast iterative algorithm for support vector machine (SVM) classifier design. The basic problem treated is one that does not allow classification violations. The problem is converted to a problem of computing the nearest point between two convex polytopes. The suitability of two classical nearest point algorithms, due to Gilbert, and Mitchell, Dem'yanov and Malozemov, is studied. Ideas from both these algorithms are combined and modified to derive our fast algorithm. For problems which require classification violations to be allowed, the violations are quadratically penalized and an idea due to Friess is used to convert it to a problem in which there are no classification violations. Comparative computational evaluation of our algorithm against powerful SVM methods such as Platt's Sequential Minimal Optimization shows that our algorithm is very competitive.Click here for a postscript file containing the original technical report with pseudocode.
Click here for a fortran code.
In this paper, a stochastic connectionist approach is proposed for solving function optimization problems with real-valued parameters. With the assumption of increased processing capability of a node in the connectionist network, we show how a broader class of problems can be solved. As the proposed approach is a stochastic search technique, it avoids getting stuck in local optima. Robustness of the approach is demostrated on several multi-modal functions with different numbers of variables. Optimization of a well-known partitional clustering criterion, the squared-error criterion (SEC), is formulated as a function optimization problem and is solved using the proposed approach. This approach is used to cluster selected data sets and the results obtained are compared with that of the K-means algorithm and a simulated annealing (SA) approach. The amenability of the connectionist approach to parallelization enables effective use of parallel hardware.
This paper points out an important source of inefficiency in Smola and Scholkopf's Sequential Minimal Optimization (SMO) algorithm for SVM regression that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO for regression. These modified algorithms perform significantly faster than the original SMO on the datasets tried.
An alternate derivation of Plefka's method is presented which is obtained from minimizing Gibbs energy. It is shown that this method can be rederived using a perturbation expansion of Kullback-Leibler divergence, the latter having a nice information geometric interpretation.
The somersaulting maneuver of a platform diver is studied. An effective numerical approach for obtaining an optimal solution for this motion is given. The diver is modeled as a planar system of interconnected multibodies, and controllability is proved in a sense dictated by the problem. A time-optimal control problem with state and control constraints is set up and solved using a numerical approach. The numerical solution agrees well with motions executed by professional divers.
An algorithm for computing the Euclidean distance between a pair of convex sets in Rm is described. Extensive numerical experience with a broad family of polytopes in R3 shows that the computational cost is approximately linear in the total number of vertices specifying the two polytopes. The algorithm has special features which makes its application in a variety of robotics problems attractive. These features are discussed and an example of collision detection is given.
Stability results are given for a class of feedback systems arising from the regulation of time-varying discrete-time systems using optimal infinite-horizon and moving-horizon feedback laws. The class is characterized by joint constraints on the state and the control, a general nonlinear cost function and nonlinear equations of motion possessing two special properties. It is shown that weak conditions on the cost function and the constraints are sufficient to guarantee uniform asymptotic stability of both the infinite-horizon and moving-horizon feedback systems. The infinite-horizon cost associated with the moving-horizon feedback law approaches the optimal infinite-horizon cost as the moving-horizon is extended.
Last updated: May 15, 2012