I am a researcher in Criteo Research, a great team of researchers (spread out in Paris, Grenoble and Palo Alto) working on fundamental and applied research problems in computational advertising, with Suju Rajan as the VP Head of Research. Previously I was in Microsoft (June 2012-December 2017) (located in Mountain View, CA), first with the CISL team in Big Data and later with the FAST division of Microsoft Office. From January 2004-April 2012 I was with the Machine Learning Group of Yahoo! Research, in Santa Clara, CA. My recent research has mainly focused on the design of distributed training algorithms for developing various types of linear and nonlinear models on Big Data, and the application of machine learning to textual problems.

Prior to joining Yahoo! Research, I worked for 11 years at the Indian Institute of Science, Bangalore, and for 5 years at the National University of Singapore. During those sixteen years my research focused on the development of practical algorithms for a variety of areas, such as machine learning, robotics, computer graphics and optimal control. (Many of the publications during that period are not mentioned in this page.) My works on support vector machines (e.g., improved SMO algorithm), polytope distance computation (e.g., GJK algorithm) and model predictive control (e.g., stability theory) are highly cited. Overall, I have published more than 100 papers in leading journals and conferences. I am an Action Editor of JMLR (Journal of Machine Learning Research) since 2008. Previously I was an Associate Editor for the IEEE Transactions on Automation Science and Engineering.

**Contact:** s.selvaraj at criteo dot com

**Slide deck of my talk on Interplay between Optimization and Generalization in Deep Neural Networks given at the 3rd annual Machine Learning in the Real World Workshop organized by Criteo Research, Paris, on 8th November, 2017: Optimization_and_Generalization_Keerthi_Criteo_November_08_2017.pptx**. This is a review and critique of recent works in this topic. The actual talk was for 45 minutes and I covered the main ideas quickly. The ppt has more detailed material. I intend to update the slide deck as new works are published on this and related topics.

**Slide deck of my talks on Optimization for machine learning given at UC Santa Cruz in February, 2017: Keerthi_Optimization_For_ML_UCSC_2017.pdf**

**In 2010 I attended and gave a talk at GilbertFest, a symposium in honor of my Ph.D thesis advisor, Elmer G. Gilbert. Check out the symposium page, which also has pdfs of his classic papers in Control and Optimization. I am honored to have some of my joint papers with him in that list. Also, check out his A Life in Control talk given at the University of Michigan, Ann Arbor covering his marvelous career in control systems. **

- LIBLINEAR-a
*Library for Large Linear Classification*written by Chih-Jen Lin and his students. It has codes for the methods covered in the ICML'08, KDD'08, ICML'07/JMLR'08 papers given below.**Check out LIBLINEAR's most recent Distributed version. It gives cool speedups in multicore settings. For the cluster case where communication is a bottleneck, the algorithms in our JMLR-2017 papers below are very good.**

Citations of my papers in Google Scholar

2018 2017 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 and Earlier

**Batch-Expansion Training: An Efficient Optimization Framework**. With Michal Derezinski, Dhruv Mahajan, Vishy Vishwanathan, and Markus Weimer, To be presented in AISTATS, 2018. [abstract]We propose Batch-Expansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset. As opposed to stochastic approaches, batches do not need to be resampled i.i.d. at every iteration, thus making BET more resource efficient in a distributed setting, and when disk-access is constrained. Moreover, BET can be easily paired with most batch optimizers, does not require any parameter-tuning, and compares favorably to existing stochastic and batch methods. We show that when the batch size grows exponentially with the number of outer iterations, BET achieves optimal O(1/epsilon) data-access convergence rate for strongly convex objectives. Experiments in parallel and distributed settings show that BET performs better than standard batch and stochastic approaches.

**An efficient distributed learning algorithm based on effective local functional approximations**. With Dhruv Mahajan, Nikunj Agrawal, S. Sundararajan and Leon Bottou, To appear in JMLR, 2017. [abstract]Scalable machine learning over big data is an important problem that is receiving a lot of attention in recent years. On popular distributed environments such as Hadoop running on a cluster of commodity machines, communication costs are substantial and algorithms need to be designed suitably considering those costs. In this paper we give a novel approach to the distributed training of linear classifiers (involving smooth losses and L_2 regularization) that is designed to reduce the total communication costs. At each iteration, the nodes minimize locally formed approximate objective functions; then the resulting minimizers are combined to form a descent direction to move. Our approach gives a lot of freedom in the formation of the approximate objective function as well as in the choice of methods to solve them. The method is shown to have O(log(1/epsilon)) time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the Terascale method~\citep{agarwal2011}, which is a state of the art distributed solver based on the statistical query model~\citep{chu2006} that computes function and gradient values in a distributed fashion. We also evaluate against other recent distributed methods and demonstrate superior performance of our method.

**A distributed block coordinate descent method for training l_1 regularized linear classifiers**. With Dhruv Mahajan and S. Sundararajan, To appear in JMLR, 2017. [abstract]Distributed training of l_1 regularized classifiers has received great attention recently. Most existing methods approach this problem by taking steps obtained from approximating the objective by a quadratic approximation that is decoupled at the individual variable level. These methods are designed for multicore systems where communication costs are low. They are inefficient on systems such as Hadoop running on a cluster of commodity machines where communication costs are substantial. In this paper we design a distributed algorithm for $l_1$ regularization that is much better suited for such systems than existing algorithms. A careful cost analysis is used to support these points and motivate our method. The main idea of our algorithm is to do block optimization of many variables on the actual objective function within each computing node; this increases the computational cost per step that is matched with the communication cost, and decreases the number of outer iterations, thus yielding a faster overall method. Distributed Gauss-Seidel and Gauss-Southwell greedy schemes are used for choosing variables to update in each step. We establish global convergence theory for our algorithm, including Q-linear rate of convergence. Experiments on two benchmark problems show our method to be much faster than existing methods.

**Gradient Boosted Decision Trees for High Dimensional Sparse Output**. With Si Si, Cho-Jui Hsieh, Huan Zhang, Dhruv Mahajan and Inderjit Dhillon, Accepted in ICML, 2017. [abstract]In this paper, we study the gradient boosted decision trees (GBDT) when the output space is high dimensional and sparse. For example, in multilabel classification, the output space is a L-dimensional 0/1 vector, where L is number of labels that can grow to millions and beyond in many modern applications. We show that GBDT can easily run out of memory or encounter near-forever running time in this regime, and propose a new GBDT variant, called GBDT-SPARSE to resolve this problem by employing L_0 regularization. We then discuss in detail how to utilize this sparsity to conduct GBDT training, including splitting the nodes, computing the sparse residual, and predicting in sublinear time. Finally, we apply our algorithm to extreme multilabel classification problems, and show that the proposed GBDT-SPARSE achieves an order of magnitude improvements in model size and prediction time over existing methods, while yielding similar performance.

**Towards a Better Understanding of Predict and Count Models**. With Tobias Schnabel and Rajiv Khanna, arXiv:1511.02024v1, 2015. [abstract]In a recent paper, Levy and Goldberg pointed out an interesting connection between prediction-based word embedding models and count models based on pointwise mutual information. Under certain conditions, they showed that both models end up optimizing equivalent objective functions. This paper explores this connection in more detail and lays out the factors leading to di?erences between these models. We ?nd that the most relevant di?erences from an optimization perspective are (i) predict models work in a low dimensional space where embedding vectors can interact heavily; (ii) since predict models have fewer parameters, they are less prone to over?tting. Motivated by the insight of our analysis, we show how count models can be regularized in a principled manner and provide closed-form solutions for L1 and L2 regularization. Finally, we propose a new embedding model with a convex objective and the additional bene?t of being intelligible.

**Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues**. With Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, S. Sundararajan, Steve Herbert, Sudheer Dhulipalla, and Qihong Shao, KDD, 2015. [abstract]We propose a machine learning based framework for building a hierarchical monitoring system to detect and diagnose service issues. We demonstrate its use for building a monitoring system for a distributed data storage and computing service consisting of tens of thousands of machines. Our solution has been deployed in production as an end-to-end system, starting from telemetry data collection from individual machines, to a visualization tool for service operators to examine the detection outputs. Evaluation results are presented on detecting 19 customer impacting issues in the past three months.

**Near Real-time Service Monitoring Using High-dimensional Time Series**. With Vinod Nair, Sundararajan Sellamanickam, Shwetabh Khanduja, Ameya Raul, and Ajesh Shaj, ICDM, 2015. [abstract]We demonstrate a near real-time service monitoring system for detecting and diagnosing issues from high-dimensional time series data. For detection, we have implemented a learning algorithm that constructs a hierarchy of detectors from data. It is scalable, does not require labelled examples of issues for learning, runs in near real-time, and identifies a subset of counter time series as being relevant for a detected issue. For diagnosis, we provide efficient algorithms as post-detection diagnosis aids to find further relevant counter time series at issue times, a SQLlike query language for writing flexible queries that apply these algorithms on the time series data, and a graphical user interface for visualizing the detection and diagnosis results. Our solution has been deployed in production as an end-to-end system for monitoring an industrial distributed data storage and computing service consisting of tens of thousands of machines and currently analyses 12000 counter time series.

**A distributed block coordinate descent method for training L1 regularized linear classifiers**. With Dhruv Mahajan and S. Sundararajan, submitted to JMLR, 2014. [abstract]Distributed training of L1 regularized classifiers has received great attention recently. Most existing methods approach this problem by taking steps obtained from approximating the objective by a quadratic approximation that is decoupled at the individual variable level. These methods are designed for multicore systems where communication costs are low. They are inefficient on systems such as Hadoop running on a cluster of commodity machines where communication costs are substantial. In this paper we design a distributed algorithm for L1 regularization that is much better suited for such systems than existing algorithms. A careful cost analysis is used to support these points and motivate our method. The main idea of our algorithm is to do block optimization of many variables on the actual objective function within each computing node; this increases the computational cost per step that is matched with the communication cost, and decreases the number of outer iterations, thus yielding a faster overall method. Distributed Gauss-Seidel and Gauss-Southwell greedy schemes are used for choosing variables to update in each step. We establish global convergence theory for our algorithm, including Q-linear rate of convergence. Experiments on two benchmark problems show our method to be much faster than existing methods.

**An efficient distributed learning algorithm based on effective local functional approximations**. With Dhruv Mahajan, Nikunj Agrawal, S. Sundararajan and Leon Bottou, submitted to JMLR, 2014. [abstract]Scalable machine learning over big data is an important problem that is receiving a lot of attention in recent years. On popular distributed environments such as Hadoop running on a cluster of commodity machines, communication costs are substantial and algorithms need to be designed suitably considering those costs. In this paper we give a novel approach to the distributed training of linear classifiers (involving smooth losses and L2 regularization) that is designed to reduce the total communication costs. At each iteration, the nodes minimize locally formed approximate objective functions; then the resulting minimizers are combined to form a descent direction to move. Our approach gives a lot of freedom in the formation of the approximate objective function as well as in the choice of methods to solve them. The method is shown to have O(log(1/epsilon)) time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the Terascale method (Agarwal et al, 2011), which is a state of the art distributed solver based on the statistical query model (Chu et al, 2006) that computes function and gradient values in a distributed fashion. We also evaluate against other recent distributed methods and demonstrate superior performance of our method.

**Tractable semi-supervised learning of complex structured prediction models**. With Kai-Wei Chang and S. Sundararajan, ECML, 2013. [abstract]Semi-supervised learning has been widely studied in the literature. However, most previous works assume that the output structure is simple enough to allow the direct use of tractable inference/learning algorithms (e.g., binary label or linear chain).Therefore, these methods cannot be applied to problems with complex structure. In this paper, we propose an approximate semi-supervised learning method that uses piecewise training for estimating the model weights anda dual decomposition approach for solving the inference problem of finding the labels of unlabeled data subject to domain specific constraints.This allows us to extend semi-supervised learning to general structured prediction problems.As an example, we apply this approach to the problem of multi-label classification (a fully connected pairwise Markov random field). Experimental results on benchmark data show that, in spite of using approximations, the approach is effective and yields good improvements in generalization performance over the plain supervised method. In addition, we demonstrate that our inference engine can be applied to other semi-supervised learning frameworks, and extends them to solve problems with complex structure.

**Extension of TSVM to multi-class and hierarchical textclassification problems with general losses (Long version)**. With S. Sundararajan and S.K. Shevade, COLING, 2012. [abstract]Transductive SVM (TSVM) is a well known semi-supervised large margin learning method for binary text classification. In this paper we extend this method to multi-class and hierarchicalclassification problems. We point out that the determination of labels of unlabeled examples with fixed classifier weights is a linear programming problem. We devise an efficient technique for solving it. The method is applicable to general loss functions. We demonstrate the value of the new method using large margin loss on a number of multiclass and hierarchical classification datasets. For maxent loss we show empirically that our method is better than expectation regularization/constraint and posterior regularization methods, and competitive with the version of entropy regularization method which uses label constraints.

**Iterative Viterbi A* algorithm for K-best sequential coding**. With Z. Huang, Y. Chang, B. Long, J.F. Crespo, A. Dong and S.L. Wu, ACL, 2012. [abstract]Sequential modeling has been widely used in a variety of important applications including named entity recognition and shallow parsing. However, as more and more real time large-scale tagging applications arise, decoding speed has become a bottleneck for existing sequential tagging algorithms. In this paper we propose 1-best A*, 1-best iterative A*, k-best A* and k-best iterative Viterbi A* algorithms for sequential decoding. We show the efficiency of these proposed algorithms for five NLP tagging tasks. In particular, we show that iterative Viterbi A* decoding can be several times or orders of magnitude faster thanthe state-of-the-art algorithm for tagging tasks with a large number of labels. This algorithm makes real-time large-scale tagging applications with thousands of labels feasible.

**A simple approach to the design of site-level extractors usingdomain-centric principles**. With C. Long, X. Geng and C. Xu, CIKM, 2012. [abstract]We consider the problem of extracting, in a domain-centric fashion, a given set of attributes from a large number of semi-structured websites. Previous approaches [7, 5] to solve this problem are based on page level inference. We propose a distinct new approach that directly chooses attribute extractors for a site using a scoring mechanism that is designed at the domain level via simple classification methods using a training set from a small number of sites. To keep the number of candidate extractors in each site manageably small we use two observations that hold in most domains: (a) imprecise annotators can be used to identify a small set of candidate extractors for a few attributes (anchors); and (b) non-anchor attributes lie in close proximity to the anchor attributes. Experiments on three domains (Events, Books and Restaurants) show that our approach is very effective in spite of its simplicity.

**Deterministic annealing for semi-supervisedstructured output learning**. With P. Dhillon, K. Bellare, O. Chapelle and S. Sundararajan, AISTATS, 2012. [abstract]In this paper we propose a new approach forsemi-supervised structured output learning.Our approach uses relaxed labeling on unlabeleddata to deal with the combinatorialnature of the label space and further uses domainconstraints to guide the learning. Sincethe overall objective is non-convex, we alternatebetween the optimization of the modelparameters and the label distribution of unlabeleddata. The alternating optimizationcoupled with deterministic annealing helps usachieve better local optima and as a resultour approach leads to better constraint satisfactionduring inference. Experimental resultson sequence labeling benchmarks showsuperior performance of our approach comparedto Constraint Driven Learning (CoDL)and Posterior Regularization (PR).

**Regularized structured output learning with partial labels**. With S. Sundararajan and C. Tiwari. SDM, 2012. [abstract]We consider the problem of learning structured output probabilistic models with training examples having partial labels.Partial label scenarios arise commonly in web applicationssuch as taxonomy (hierarchical) classification, multi-labelclassification and information extraction from web pages.For example, label information may be available only at theinternal node level (not at the leaf level) for some pages in ataxonomy classification problem. In a multi-label classification problem, it may be available only for some of the classes(in each example). Similarly, in a sequence learning problem, we may have label information only for some nodes inthe training sequences. Conventionally, marginal likelihoodmaximization technique has been used to solve these problems. In such a solution unlabeled examples and any sideinformation like expected label distribution (or correlation ina multi-label setting) of the unlabeled part are not used. Wesolve these problems by incorporating entropy and label distribution or correlation regularizations along with marginallikelihood. Entropy and label distribution regularizationshave been used previously in semi-supervised learning withfully unlabeled examples. In this paper we develop probabilistic taxonomy and multi-label classifier models, and provide the ideas needed for expanding their usage to the partiallabel scenario. Experiments on real-life taxonomy and multilabel learning problems show that significant improvementsin accuracy are achieved by incorporating these regularizations, when most of the examples are only partially labeled.

**A sequential dual method for structural SVMs**. With P. Balamurugan, S.K. Shevade and S. Sundararajan, SDM, 2011. [abstract]In many real world prediction problems the output is a structuredobject like a sequence or a tree or a graph. Suchproblems range from natural language processing to computationalbiology or computer vision and have been tackledusing algorithms, referred to as structured output learningalgorithms. We consider the problem of structured classification.In the last few years, large margin classifiers like supportvector machines (SVMs) have shown much promise forstructured output learning. The related optimization problemis a convex quadratic program (QP) with a large numberof constraints, which makes the problem intractable forlarge data sets. This paper proposes a fast sequential dualmethod (SDM) for structural SVMs. The method makes repeatedpasses over the training set and optimizes the dualvariables associated with one example at a time. The useof additional heuristics makes the proposed method moreefficient. We present an extensive empirical evaluation ofthe proposed method on several sequence learning problems.Our experiments on large data sets demonstrate that theproposed method is an order of magnitude faster than stateof the art methods like cutting-plane method and stochasticgradient descent method (SGD). Further, SDM reachessteady state generalization performance faster than the SGDmethod. The proposed SDM is thus a useful alternative forlarge scale structured output learning.

**Semi-supervised multi-task learning of structuredprediction models for web information extraction**. With P. Dhillon and S. Sundararajan. CIKM, 2011. [abstract]Extracting information from web pages is an important problem;it has several applications such as providing improvedsearch results and construction of databases to serve userqueries. In this paper we propose a novel structured predictionmethod to address two important aspects of the extractionproblem: (1) labeled data is available only for a smallnumber of sites and (2) a machine learned global model doesnot generalize adequately well across many websites. Forthis purpose, we propose a weight space based graph regularizationmethod. This method has several advantages.First, it can use unlabeled data to address the limited labeleddata problem and falls in the class of graph regularizationbased semi-supervised learning approaches. Second,to address the generalization inadequacy of a global model,this method builds a local model for each website. Viewingthe problem of building a local model for each website asa task, we learn the models for a collection of sites jointly;thus our method can also be seen as a graph regularizationbased multi-task learning approach. Learning the modelsjointly with the proposed method is very useful in two ways:(1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites;and (2) even for a website with only unlabeled examples it ispossible to learn a decent local model. We demonstrate theefficacy of our method on several real-life data; experimentalresults show that significant performance improvement canbe obtained by combining semi-supervised and multi-tasklearning in a single framework.

**A pairwise ranking based approach to learning withpositive and unlabeled examples**. With S. Sundararajan and P. Garg. CIKM, 2011. [abstract]A large fraction of binary classification problems arising in webapplications are of the type where the positive class is well definedand compact while the negative class comprises everything else inthe distribution for which the classifier is developed; it is hard torepresent and sample from such a broad negative class. Classifiersbased only on positive and unlabeled examples reduce human annotationeffort significantly by removing the burden of choosing arepresentative set of negative examples. Various methods have beenproposed in the literature for building such classifiers. Of these,the state of the art methods are Biased SVM and Elkan & Noto'smethods. While these methods often work well in practice, theyare computationally expensive since hyperparameter tuning is veryimportant, particularly when the size of labeled positive examplesset is small and class imbalance is high. In this paper we proposea pairwise ranking based approach to learn from positive and unlabeledexamples (LPU) and we give a theoretical justification for it.We present a pairwise RankSVM (RSVM) based method for ourapproach. The method is simple, efficient, and its hyperparametersare easy to tune. A detailed experimental study using severalbenchmark datasets shows that the proposed method gives competitiveclassification performance compared to the mentioned state ofthe art methods, while training 3-10 times faster. We also proposean efficient AUC based feature selection technique in the LPU settingand demonstrate its usefulness on the datasets. To get an ideaof the goodness of the LPU methods we compare them against supervisedlearning (SL) methods that also make use of negative examplesin training. SL methods give a slightly better performancethan LPU methods when there is a rich set of negative examples;however, they are inferior when the number of negative trainingexamples is not large enough.

**Semi-supervised SVMs for classification with unknownclass proportions and a small labeled dataset**. With B. Bhar, S. Sundararajan and S.K. Shevade. CIKM, 2011. [abstract]In the design of practical web page classification systems oneoften encounters a situation in which the labeled trainingset is created by choosing some examples from each class;but, the class proportions in this set are not the same asthose in the test distribution to which the classifier willbe actually applied. The problem is made worse when theamount of training data is also small. In this paper we ex-plore and adapt binary SVM methods that make use of un-labeled data from the test distribution, viz., TransductiveSVMs (TSVMs) and expectation regularization/constraint(ER/EC) methods to deal with this situation. We empir-ically show that when the labeled training data is small,TSVM designed using the class ratio tuned by minimizingthe loss on the labeled set yields the best performance; itsperformance is good even when the deviation between theclass ratios of the labeled training set and the test set isquite large. When the labeled training data is sufficientlylarge, an unsupervised Gaussian mixture model can be usedto get a very good estimate of the class ratio in the test set;also, when this estimate is used, both TSVM and EC/ERgive their best possible performance, with TSVM comingout superior. The ideas in the paper can be easily extendedto multi-class SVMs and MaxEnt models.

**Transductive classification methods for mixed graphs**. With S. Sundararajan. MLG-2011 (KDD Workshop), 2011. [abstract]In this paper we provide a principled approach to solve atransductive classification problem involving a similar graph(edges tend to connect nodes with same labels) and a dissimilargraph (edges tend to connect nodes with opposing labels).Most of the existing methods, e.g., Information Regularization(IR), Weighted vote Relational Neighbor classifier(WvRN) etc, assume that the given graph is only a similargraph. We extend the IR and WvRN methods to deal withmixed graphs. We evaluate the proposed extensions on severalbenchmark datasets as well as two real world datasetsand demonstrate the usefulness of our ideas.

**Large scale information extraction from the web**. Industrial Problems Seminar, IMA, Univ. of Minnesotta, Feb 11, 2011. [abstract]Talk that descibes approaches to large scale information extraction from the web.

**Efficient algorithms for ranking with SVMs**. With O. Chapelle. Information Retrieval Journal, 13(3):201.215, 2010. [abstract]RankSVM (Herbrich et al, 2000; Joachims, 2002) is a pairwise methodfor designing ranking models. SVMLight is the only publicly available software forRankSVM. It is slow and, due to incomplete training with it, previous evaluationsshow RankSVM to have inferior ranking performance. We propose new methods basedon primal Newton method to speed up RankSVM training and show that they are 5 ordersof magnitude faster than SVMLight. Evaluation on the Letor benchmark datasetsafter complete training using such methods shows that the performance of RankSVMis excellent.

**Graph based classification methods using inaccurate externalclassifier information**. With S. Sundararajan, Tech Report, 2009. [abstract]In this paper we consider the problem of collectively classifyingentities where relational information is available acrossthe entities. In practice inaccurate class distribution for eachentity is often available from another (external) classifier.For example this distribution could come from a classifierbuilt using content features or a simple dictionary. Giventhe relational and inaccurate external classifier information,we consider two graph based settings in which the problemof collective classification can be solved. In the first settingthe class distribution is used to fix labels to a subset of nodesand the labels for the remaining nodes are obtained like in atransductive setting. In the other setting the class distributionsof all nodes are used to define the fitting function partof a graph regularized objective function. We define a generalized objective function that handles both the settings.Methods like harmonic Gaussian field and local-global consistency (LGC) reported in the literature can be seen as special cases. We extend the LGC and weighted vote relationalneighbor classification (WvRN) methods to support usage ofexternal classifier information. We also propose an educientleast squares regularization (LSR) based method and relateit to information regularization methods. All the methodsare evaluated on several benchmark and real world datasets.Considering together speed, robustness and accuracy, experimental results indicate that the LSR and WvRN-extensionmethods perform better than other methods.

**A dual coordinate descent method for large-scale linear SVM**. With C.-J. Hsieh, K.-W. Chang, C.-J. Lin, and S. Sundararajan. ICML 2008. [abstract]In many applications, data appear with a huge number of instances as well as features.Linear Support Vector Machines (SVM) is one of the most populartools to deal with such large-scale sparse data.This paper presents a novel dual coordinate descent method for linear SVM withL1- and L2-loss functions.The proposed method is simple and reaches an $\epsilon$-accurate solution in$O(\log (1/\epsilon))$ iterations.Experiments indicate that our method is muchfaster than state of the artsolvers such as \Pegasos, \Tron, \svmperf, and a recent primal coordinatedescent implementation.

**A sequential dual coordinate method for large-scale multi-class linear SVMs**. With S. Sundararajan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. KDD 2008. [abstract]Efficient training of direct multi-class formulations of linearSupport Vector Machines is very useful in applications suchas text classification with a huge number examples as wellas features. This paper presents a fast dual method for thistraining. The main idea is to sequentially traverse throughthe training set and optimize the dual variables associatedwith one example at a time. The speed of training is enhancedfurther by shrinking and cooling heuristics. Experimentsindicate that our method is much faster than state ofthe art solvers such as bundle, cutting plane and exponentiatedgradient methods.

**Optimization techniques for semi-supervised SVMs**. With Olivier Chapelle and Vikas Sindhwani. JMLR, 2008. [abstract]Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S^3VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S^3VMs. This paper reviews key ideas in this literature. The performance and behavior of various S^3VM algorithms is studied together, under a common experimental setting.

**Multi-Class Feature Selection with Support Vector Machines**. With Olivier Chapelle. JSM, 2008. [abstract]We consider feature selection in a multi-class settingwhere the goal is to find a small set of features for all theclasses simultaneously. We develop an embedded methodfor this problem using the idea of scaling factors. Traininginvolves the solution of a convex program for which wegive a scalable algorithm. The method is closely relatedto extensions of L1 regularization and recursive featureelimination. These methods are shown to be very ectivein text classification.

**Ad delivery with budgeted advertisers: A comprehensive LP approach**. With Zoe Abrams, Ofer Mendelevitch and John Tomlin. JECR, Vol. 9, No.1, 2008. [abstract]We study a comprehensive framework for sponsored search which incorporates advertiser budgets, query frequency forecasts, and pricing and ranking schemes. We propose a linear program for optimizing revenue (or the total value to advertisers) that has an exponential number of variables; however, we describe how it can be solved efficiently using column generation. The formulation is easily extendable to various levels of problem complexity, adaptable to dynamic environments, fast, and works well in terms of practical considerations. Simulations show significant improvements in revenue and efficiency.

**Trust-region Newton method for large scale logistic regression**. With Chih-Jen Lin and Ruby Weng. JMLR 2008, Vol. 9, pp. 627-650. [abstract]Large-scale logistic regression arises in many applications such as document classification andnatural language processing. In this paper, we apply a trust region Newton method to maximizethe log-likelihood of the logistic regression model. The proposed method uses only approximateNewton steps in the beginning, but achieves fast convergence in the end. Experiments show that itis faster than the commonly used quasi Newton approach for logistic regression. We also extendthe proposed method to large-scale L2-loss linear support vector machines (SVM).

**Participation (two winning entries in the wild track) in PASCAL Large Scale Learning Challenge**. With Olivier Chapelle, 2008. [abstract]The PASCAL challengewas conducted in Feb-June 2008 to evaluate the scalability and efficiency of existing ML approaches with respect to computational, memory or communication resources. This set of slides describes our participation using someSVM techniques that we developed.

**Predictive approaches for Gaussian process classifiermodel selection**. With S. Sundararajan, Tech Report, 2008. [abstract]In this paper we consider the problem of Gaussian process classifier (GPC) model selectionwith different Leave-One-Out (LOO) Cross Validation (CV) based optimization criteriaand provide a practical algorithm using LOO predictive distributions with such criteria toselect hyperparameters. Apart from the standard average negative logarithm of predictiveprobability (NLP), we also consider smoothed versions of criteria such as F-measure andWeighted Error Rate (WER), which are useful for handling imbalanced data. Unlike theregression case, LOO predictive distributions for the classifier case are intractable. Weuse approximate LOO predictive distributions arrived from Expectation Propagation (EP)approximation. We conduct experiments on several real world benchmark datasets. Whenthe NLP criterion is used for optimizing the hyperparameters, the predictive approachesshow better or comparable NLP generalization performance with existing GPC approaches.On the other hand, when the F-measure criterion is used, the F-measure generalizationperformance improves significantly on several datasets. Overall, the EP-based predictivealgorithm comes out as an excellent choice for GP classifier model selection with differentoptimization criteria.

**CRF versus SVM-struct for sequence labeling**. With S. Sundararajan. Yahoo Research Technical Report, 2007. [abstract]A recent comparison of various algorithms for sequence labeling by Nguyen and Guo indicated that SVM-struct has much superior generalization performance than CRFs. In this short report we point out that the above difference mainly arises because that comparison employed different softwares that use different internal feature functions. When the two methods are compared using identical feature functions they do turn out to have quite close peak performance.

**Predictive approaches for sparse Gaussian process regression**. With S. Sundararajan and Shirish Shevade. Tech Report, April 2007. [abstract]We propose and study algorithms using leave-one-out cross validation (LOO-CV) based predictive measures namely, LOO-CV error (LOO-CVE), Geisser's surrogate Predictive Probability (GPP) and Predictive Mean Squared Error (GPE) to select basis vectors for building sparse Gaussian process regression models. While the LOO-CVE measure uses only predictive mean information, the GPP and GPE measures use predictive variance information as well. All of these three LOO-CV based predictive measures can be used to find the number of basis vectors in the model automatically. The computational complexities of these measures are same as that of marginal likelihood and approximate log posterior probablity maximization approaches. We also give an efficient cache implementation for all the algorithms which gives similar or better generalization performance with lesser number of basis vectors. The experimental results on several real world benchmark datasets show better or comparable generalization performance over existing approaches.

**Constructing a maximum utility slate of on-line advertisements**. With John Tomlin. Submitted for publication in ORSA Journal of Computing. [abstract]We present an algorithm for constructing an optimal slate of sponsored search advertisements which respects the ordering that is the outcome of a generalized second price auction, but which must also accommodate complicating factors such as overall budget constraints. The algorithm is easily fast enough to use on the fly for typical problem sizes, or as a subroutine in an overall optimization.

**Trust-region Newton method for large scale logistic regression**. With Chih-Jen Lin and Ruby Weng. ICML 2007. [abstract]Large-scale logistic regression arises in many applicationssuch document classification and natural language processing.In this paper, we apply a trust region Newton methodto maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps inthe beginning, but achieves fast convergence in the end. Experimentsshow that it is faster than the commonly used quasi Newtonapproach for logistic regression. We also compareit with existing linear SVM implementations.

**Semi-supervised Gaussian processes**. With Wei Chu and Vikas Sindhwani. IJCAI 2007. [abstract]We consider the problem of utilizing unlabeled data for Gaussian process inference. Using a geometrically motivated data-dependent prior, we propose a graph-based construction of semi-supervised Gaussian processes. We demonstrate this approach empirically on several classification problems.

**An efficient method for gradient-based adaptation of hyperparameters in SVM models**. With Vikas Sindhwani and Olivier Chapelle. NIPS 2006. [Longer tech report.] [abstract]We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold cross-validation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations.

**Relational learning with Gaussian processes**. With Wei Chu, Vikas Sindhwani and Zoubin Ghahramani. NIPS 2006. [abstract]Correlation between instances is often modelled via a kernel function using input attributes of the instances. Relational knowledge can further reveal additional pairwise correlations between variables of interest. In this paper, we develop a class of models which incorporates both reciprocal relational information and input attributes using Gaussian process techniques. This approach provides a novel non-parametric Bayesian framework with a data-dependent prior for supervised learning tasks. We also apply this framework to semi-supervised learning. Experimental results on several real world data sets verify the usefulness of this algorithm.

**Branch and bound for semi-supervised support vector machines**. With Olivier Chapelle and Vikas Sindhwani. NIPS 2006. [abstract]Semi-supervised SVMs (S3VMs) attempt to learn low-density separators by maximizing the margin over labeled and unlabeled examples. The associated optimization problem is non-convex. To examine the full potential of S3VMs modulo local minima problems in current implementations, we apply branch and bound techniques for obtaining exact, globally optimal solutions. Empirical evidence suggests that the globally optimal solution can return excellent generalization performance in situations where other implementations fail completely. While our current implementation is only applicable to small datasets, we discuss variants that can potentially lead to practically useful algorithms.

**Building Support Vector Machines with Reduced Classifier Complexity**. With Olivier Chapelle and Dennis DeCoste. JMLR 2006. [abstract]Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax) to approximate the SVM primal cost function well; (3) it is efficient and roughly scales as O(n dmax^2) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.

**Support vector ordinal regression**. With Wei Chu. Neural Computation 2006. [abstract]In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on some benchmark and real-world data sets, including applications of ordinal regression to information retrieval, verify the usefulness of these approaches.

See Wei Chu's homepage for a code.

**Fast generalized cross validation algorithm for sparse model learning**. With S. Sundararajan and Shirish Shevade. Neural Computation 2006. [abstract]In this paper we propose a fast incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross validation approach. The performances on synthetic and real-world datasets are compared with other incremental algorithms such as Tipping and Faul's fast Relevance Vector Machine, Chen et al's Orthogonal Least Squares and Orr's Regularized Forwards Selection. The results demonstrate that the proposed approach is competitive.

**Large-scale semi-supervised linear SVMs**. With Vikas Sindhwani. SIGIR 2006. [Longer tech report with pseudocodes.] [abstract]Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In many Information retrieval and data mining applications, linear classifiers are strongly preferred because of their ease of implementation, interpretability and empirical performance. In this work, we present a family of semi-supervised linear support vector classifiers that are designed to handle partially-labeled sparse datasets with possibly very large number of examples and features. At their core, our algorithms employ modified finite Newton techniques recently developed by Keerthi and DeCoste. Our contributions in this paper are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We propose a variant of TSVM that involves multiple switching of labels. Experimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-supervised learning based on a mean field annealing (MFA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also being computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semi-supervised settings.

**Newton methods for fast solution of semi-supervised linear SVMs**. With Vikas Sindhwani. Book Chapter,*Large Scale Kernel Machines*, MIT Press, 2006. [abstract]This is a revised version of the above SIGIR 2006 paper, with updated experimental results based on the SVMlin solver written by Vikas Sindhwani.

**Deterministic annealing for semi-supervised kernel machines**. With Vikas Sindhwani and Olivier Chapelle. ICML 2006. [abstract]An intuitive approach to utilizing unlabeled data in kernel-based classification algorithms is to simply treat the unknown labels as additional optimization variables. For margin-based loss functions, one can view this approach as attempting to learn low-density separators. However, this is a hard optimization problem to solve in typical semi-supervised settings where unlabeled data is abundant. The popular Transductive SVM algorithm is a label-switching-retraining procedure that is known to be susceptible to local minima. In this paper, we present a global optimization framework for semi-supervised Kernel machines where an easier problem is parametrically deformed to the original hard problem and minimizers are smoothly tracked. Our approach is motivated from deterministic annealing techniques and involves a sequence of convex optimization problems that are exactly and efficiently solved. We present empirical results on several synthetic and real world datasets that demonstrate the effectiveness of our approach.

**Parallel sequential minimal optimization for the training of support vector machines**. With LJ Cao, CJ Ong, JQ Zhang, U Periyathamby, XJ Fu, HP Lee. IEEE transactions on Neural Networks, 17(4):1039-49. [abstract]Sequential minimal optimization (SMO) is one popular algorithm for training support vector machine (SVM), but it still requires a large amount of computation time for solving large size problems. This paper proposes one parallel implementation of SMO for training SVM. The parallel SMO is developed using message passing interface (MPI). Specifically, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used. There are also satisfactory results on the Web data set.

**A fast tracking algorithm for generalized LARS/LASSO**. With Shirish Shevade. To appear in IEEE transactions on Neural Networks. [abstract]This paper gives an efficient algorithm for tracking the solution curve of sparse logistic regression with respect to the L1 regularization parameter. The algorithm is based on approximating the logistic regression loss by a piecewise quadratic function, using Rosset and Zhu's path tracking algorithm on the approximate problem, and then applying a correction to get to the true path. Application of the algorithm to text classification and sparse kernel logistic regression shows that the algorithm is efficient.

**A modified finite Newton method for fast solution of large scale linear SVMs**. With Dennis DeCoste. JMLR 2005. [abstract]This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight, SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber's loss function and the L1 loss function, and also for solving ordinal regression.

**A matching pursuit approach to sparse Gaussian process regression**. With Wei Chu. NIPS 2005. [abstract]In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions.

See Wei Chu's homepage for a code.

**A fast dual algorithm for kernel logistic regression**. With Kaibo Duan, Shirish Shevade and Aun Neow Poo. Machine Learning Journal 2005. [abstract]This paper gives a new iterative algorithm for kernel logistic regression. It is based on the solution of a dual problem using ideas similar to those of the Sequential Minimal Optimization algorithm for Support Vector Machines. Asymptotic convergence of the algorithm is proved. Computational experiments show that the algorithm is robust and fast. The algorithmic ideas can also be used to give a fast dual algorithm for solving the optimization problem arising in the inner loop of Gaussian Process classifiers.

**An improved conjugate gradient scheme to the solution of least squares SVM**. With Wei Chu and Chong Jin Ong. IEEE Transactions on Neural Networks 2005. [abstract]The Least Square Support Vector Machines (LS-SVM) formulation corresponds to the solution of a linear system of equations. Several approaches to its numerical solutions have been proposed in the literature. In this paper, we propose an improved method to the numerical solution of LS-SVM and show that the problem can be solved using one reduced system of linear equations. Compared with the existing algorithm of Suykens et al for LS-SVM, our approach is about twice as efficient. Numerical results using the proposed method are provided for comparisons with other existing algorithms.

**New approaches to support vector ordinal regression**. With Wei Chu. ICML 2005. [abstract]In this paper, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The SMO algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on benchmark datasets verify the usefulness of these approaches.

See Wei Chu's homepage for a code.

**Generalized LARS as an effective feature selection tool for text classification with SVMs**. ICML 2005. [abstract]In this paper we generalize the LARS feature selection method to the linear SVM model, derive an efficient algorithm for it, and empirically demonstrate its usefulness as a feature selection tool for text classification.

**Which is the best multiclass SVM method? An empirical study**. With Kaibo Duan. MCS 2005. [abstract]Multiclass SVMs are usually implemented by combining several two-class SVMs. The one-versus-all method using winner-takes-all strategy and the one-versus-one method implemented by max-wins voting are popularly used for this purpose. In this paper we give empirical evidence to show that these methods are overall inferior to another one-versus-one method: one that uses Platt's posterior probabilities together with the pairwise coupling idea of Hastie and Tibshirani, especially when the training dataset is sparse.

**Bayesian support vector regression using a unified loss function**. With Wei Chu and Chong Jin Ong. IEEE Transactions on Neural Networks 2004. [abstract]In this paper, we use soft insensitive loss function in likelihood evaluation, and describe a Bayesian framework in a stationary Gaussian process. Bayesian methods are used to implement model adaptation, while keeping the merits of support vector regression, such as quadratic programming and sparseness. Moreover, confidence interval is provided in prediction.

See Wei Chu's homepage for a code.

**Stability regions for constrained nonlinear systems and their functional characterization via support-vector-machine learning**. With Chong Jin Ong, Elmer Gilbert and Z.H. Zhang. Automatica 2004. [abstract]To be inserted.

**An efficient method for computing leave-one-out error in support vector machines with Gaussian kernels**. With M.M.S. Lee, Chong Jin Ong and Dennis DeCoste. IEEE Transactions on Neural Networks 2004. [abstract]In this paper we give an efficient method for computing the leave-one-out (LOO) error for support vector machines (SVMs) with Gaussian kernels quite accurately. It is particularly suitable for iterative decomposition methods of solving SVMs. The importance of various steps of the method is illustrated in detail by showing the performance on six benchmark datasets. The new method often leads to speedups of 10 to 50 times compared to standard LOO error computation. It has good promise for use in hyperparameter tuning and model comparison.

**The needs and benefits of applying textual data mining within the product development process**. With Rakesh Menon, Han Tong Loh, A. Brombacher and C. Leong. Quality and Reliability Engineering International 2004. [abstract]To be inserted.

**A simple and efficient algorithm for gene selection using sparse logistic regression**. With Shirish Shevade. Bioinformatics Journal 2003. [abstract]This paper gives a new and efficient algorithm for the sparse logistic regression problem. The proposed algorithm is based on the Gauss-Seidel method and is asymptotically convergent. It is simple and extremely easy to implement; it neither uses any sophisticated mathematical programming software nor needs any matrix operations. It can be applied to a variety of real-world problems like identifying marker genes and building a classifier in the context of cancer diagnosis using microarray data.

The gene selection method suggested in this paper is demonstrated on four real-world datasets and found to be consistent with the literature or biological knowledge of these sets.

**Asymptotic behaviours of support vector machines with gaussian kernel**. With Chih-Jen Lin. Neural Computation 2003. [abstract]Support vector machines (SVMs) with the Gaussian (RBF) kernel have been popular for practical use. Model selection in this class of SVMs involves two hyperparameters: the penalty parameter, C and the kernel width, sigma. This paper analyzes the behavior of the SVM classifier when these hyperparameters take very small or very large values. Our results help in a good understanding of the hyperparameter space that leads to an efficient heuristic method of searching for hyperparameter values with small generalization errors.

One of the key theoretical results of this paper is that, as sigma goes to infinity the SVM classifier with gaussian kernel tends to the linear SVM classifier. This is an important result since it means that, a gaussian SVM designed to minimize the generalization error over all possible C and sigma has to be equal or better in performance than a linear SVM classifier with C tuned.

**Multi-category classification by soft-max combination of binary classifiers**. With Kaibo Duan, Wei Chu and Aun Neow Poo. MCS 2003. [abstract]In this paper, we propose a multi-category classification method that combines binary classifiers through soft-max function. Posteriori probabilities are also obtained. Both, one-versus-all and one-versus-one classifiers can be used in the combination. Empirical comparison shows that the proposed method is competitive with other implementations of one-versus-all and one-versus-one methods in terms of both classification accuracy and posteriori probability estimate.

**Bayesian Trigonometric support vector classifier**. With Wei Chu and Chong Jin Ong. Neural Computation 2003. [abstract]This paper describes Bayesian techniques for support vector classification. In particular, we propose a novel differentiable loss function called the trigonometric loss function which has the desirable characteristic of natural normalization in the likelihood function, and then follow standard Gaussian processes techniques to set up a Bayesian framework. In this framework, Bayesian inference is used to implement model adaptation, while keeping the merits of support vector classifier, such as sparseness and convex programming. This differs from standard Gaussian processes for classification. Moreover, we put forward class probability in making predictions. Experimental results on benchmark data sets indicate the usefulness of this approach.

For a code, download this directory: bisvm.zip.

**SMO Algorithm for Least Squares SVM Formulations**. With Shirish Shevade. Neural Computation 2003. [abstract]This paper extends the well-known SMO algorithm of Support Vector Machines (SVMs) to Least Squares SVM formulations which include LS-SVM classification, Kernel Ridge Regression and a particular form of regularized Kernel Fisher Discriminant. The algorithm is shown to be asymptotically convergent. It is also extremely easy to implement. Computational experiments show that the algorithm is fast and scales efficiently (quadratically) as a function of the number of examples.

**Evaluation of simple performance measures for tuning SVM hyperparameters**. With Kaibo Duan and Aun Neow Poo. Neurocomputing 2003. [abstract]Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims' Xi-Alpha bound and Wahba et al's GACV come next and perform reasonably well. For SVMs with L2 soft margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.

Follow this link for details: comparison.html.

**Thermal error measurement and modelling in machine tools. Part II. Hybrid Bayesian Network - support vector machine model**. With R. Ramesh, M.A. Mannan and Aun Neow Poo. International Journal of Machine Tools and Manufacture 2003. [abstract]To be inserted.

**A Machine Learning Approach for the Curation of Biomedical Literature -- KDD Cup 2002 (Task 1)**. [abstract]In collaboration with DTI Institute and Genome Institute of Singapore our Datamining group took part in Task 1 (Information extraction from Biomedical articles) of KDD Cup 2002 and got a honourable mention. Click here for a short article written for the university monthly magazine. For a more detailed article that was published in

*SIGKDD Explorations*click here.**Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms**. IEEE Transactions on Neural Networks 2002. [abstract]In this short report I discuss implementation issues related to the tuning of the hyperparameters of a Support Vector Machine (SVM) with L2 soft margin, for which the radius/margin bound is taken as the index to be minimized and iterative techniques are employed for computing radius and margin. The implementation is shown to be feasible and efficient even for large problems having more than 10,000 support vectors.

A report discussing the implementation issues, together with an old code running on Matlab 5.3 interface, can be downloaded from the following page: nparm.html.

**Convergence of a generalized SMO algorithm for SVM classifier design**. Machine Learning Journal 2002. [abstract]Convergence of a generalized version of the modified SMO algorithms given by Keerthi et al for SVM classifier design is proved. The convergence results are also extended to modified SMO algorithms for solving nu-SVM classifier problems.

There is a slight flaw in the final argument in the convergence proof of the above paper. This has been corrected by Takahashi and Nishi (IEEE Transactions on Neural Networks, Vol. 16, No. 3, pp. 774-776, May 2005).

**A Fast Dual Algorithm for Kernel Logistic Regression**. With Kaibo Duan, Shirish Shevade and Aun Neow Poo. ICML 2002. [abstract]This paper gives a new iterative algorithm for kernel logistic regression. It is based on the solution of the dual problem using ideas similar to those of the SMO algorithm for Support Vector Machines. Asymptotic convergence of the algorithm is proved. Computational experiments show that the algorithm is robust and fast. The algorithmic ideas can also be used to give a fast dual algorithm for solving the optimization problem arising in the inner loop of Gaussian Process classifiers.

**A unified loss function in Bayesian framework for support vector regression**. With Wei Chu and Chong Jin Ong. ICML 2001. [abstract]In this paper we use a unified loss function, called the soft insensitive loss function, for Bayesian support vector regression. We follow standard Gaussian processes for regression to set up the Bayesian framework, in which the unified loss function is used in the likelihood evaluation. Under this framework, the maximum a posteriori estimate of the function values corresponds to the solution of an extended support vector regression problem. The overall approach has the merits of support vector regression such as convex quadratic programming and sparsity in solution representation. It also has the advantages of Bayesian methods for model adaptation and error bars for its predictions. Experimental results on simulated and real-world data sets indicate that the approach works well even on large data sets.

**Evaluation of simple performance measures for tuning SVM hyperparameters**. With Kaibo Duan and Aun Neow Poo. ICONIP 2001. [abstract]Choosing optimal hyperparameter values for support vector machines is an important step in SVM design. This is usually done by minimizing either an estimate of generalization error or some other related performance measure. In this paper, we empirically study the usefulness of several simple performance measures that are inexpensive to compute (in the sense that they do not require expensive matrix operations involving the kernel matrix). The results point out which of these measures are adequate functionals for tuning SVM hyperparameters. For SVMs with L1 soft margin formulation, none of the simple measures yields a performance uniformly as good as k-fold cross validation; Joachims' Xi-Alpha bound and Wahba et al's GACV come next and perform reasonably well. For SVMs with L2 soft margin formulation, the radius margin bound gives a very good prediction of optimal hyperparameter values.

Follow this link for details: comparison.html.

**Mean-field methods for a special class of belief networks**. With Chiranjib Bhattacharyya. JAIR 2001. [abstract]The chief aim of this paper is to propose mean-field approximations for a broad class of Belief networks, of which sigmoid and noisy-or networks can be seen as special cases. The approximations are based on a powerful mean-field theory suggested by Plefka. We show that Saul, Jaakkola and Jordan' s approach is the first order approximation in Plefka's approach, via a variational derivation. The application of Plefka's theory to belief networks is not computationally tractable. To tackle this problem we propose new approximations based on Taylor series. Small scale experiments show that the proposed schemes are attractive.

**Computation of a penetration measure between 3D convex polyhedral objects for collision detection**. With K. Sridharan. Journal of Robotic Systems 2001. [abstract]To be inserted.

**Improvements to Platt's SMO algorithm for SVM classifier design**. With Shirish Shevade, Chiranjib Bhattacharyya and K.R. Krishna Murthy. Neural Computation 2001. [abstract]This paper points out an important source of confusion and inefficiency in Platt's Sequential Minimal Optimization (SMO) algorithm that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO. These modified algorithms perform significantly faster than the original SMO on all benchmark datasets tried.

The modified SMO algorithm described in the above paper has been popularly used. For example, the LIBSVM code uses a variation of this algorithm.

Click here for a postscript file containing the original technical report with pseudocode.

**Predictive approaches for choosing hyperparameters in Gaussian processes**. With S. Sundararajan. Neural Computation 2001. [abstract]Gaussian Processes are powerful non-parametric models specified by parameterized mean and covariance functions. For regression they are simpler and better than Neural Networks. Standard approaches to choose hyperparameters in Gaussian Processes are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. Recently we have proposed and investigated predictive approaches based on Geisser's Predictive Sample Reuse (PSR) methodology and the related Stone's Cross Validation (CV) methodology. More specifically, we have derived results for Geisser's surrogate Predictive Probability (GPP), Geisser's Predictive mean square Error (GPE) and the standard CV error and have made a comparative study. Within an approximation we have also arrived at the Generalized Cross Validation (GCV) and established its relationship with the GPP and GPE approaches. These approaches have been tested on a number of benchmark problems. Experimental results show that these approaches are strongly competitive to the existing approaches.

**Rule prepending and post-pruning approach to incremental learning of decision lists**. With K.R. Krishna Murthy and M. Narasimha Murty. Pattern Recognition 2001. [abstract]To be inserted.

**Predictive approaches for choosing hyperparameters in Gaussian processes**. With S. Sundararajan. NIPS 2000. [abstract]Gaussian Processes are powerful non-parametric models specified by parameterized mean and covariance functions. For regression they are simpler and better than Neural Networks. Standard approaches to choose hyperparameters in Gaussian Processes are Maximum Likelihood (ML) and Maximum APosterior (MAP) approaches. Recently we have proposed and investigated predictive approaches based on Geisser's Predictive Sample Reuse (PSR) methodology and the related Stone's Cross Validation (CV) methodology. More specifically, we have derived results for Geisser's surrogate Predictive Probability (GPP), Geisser's Predictive mean square Error (GPE) and the standard CV error and have made a comparative study. Within an approximation we have also arrived at the Generalized Cross Validation (GCV) and established its relationship with the GPP and GPE approaches. These approaches have been tested on a number of benchmark problems. Experimental results show that these approaches are strongly competitive to the existing approaches.

**A variational mean-field theory for sigmoidal belief networks**. With Chiranjib Bhattacharyya. NIPS 2000. [abstract]To be inserted.

**A fast iterative nearest point algorithm for support vector machine classifier design**. With Shirish Shevade, Chiranjib Bhattacharyya and K.R. Krishna Murthy. IEEE Transactions on Neural Networks 2000. [abstract]In this report we give a new, fast iterative algorithm for support vector machine (SVM) classifier design. The basic problem treated is one that does not allow classification violations. The problem is converted to a problem of computing the nearest point between two convex polytopes. The suitability of two classical nearest point algorithms, due to Gilbert, and Mitchell, Dem'yanov and Malozemov, is studied. Ideas from both these algorithms are combined and modified to derive our fast algorithm. For problems which require classification violations to be allowed, the violations are quadratically penalized and an idea due to Friess is used to convert it to a problem in which there are no classification violations. Comparative computational evaluation of our algorithm against powerful SVM methods such as Platt's Sequential Minimal Optimization shows that our algorithm is very competitive.

Click here for a postscript file containing the original technical report with pseudocode.

Click here for a fortran code.

**A stochastic connectionist approach for global optimization with application to pattern clustering**. With Phanendra Babu and M. Narasimha Murty. IEEE Transactions on Systems, Man and Cybernetics (Part B) 2000. [abstract]In this paper, a stochastic connectionist approach is proposed for solving function optimization problems with real-valued parameters. With the assumption of increased processing capability of a node in the connectionist network, we show how a broader class of problems can be solved. As the proposed approach is a stochastic search technique, it avoids getting stuck in local optima. Robustness of the approach is demostrated on several multi-modal functions with different numbers of variables. Optimization of a well-known partitional clustering criterion, the squared-error criterion (SEC), is formulated as a function optimization problem and is solved using the proposed approach. This approach is used to cluster selected data sets and the results obtained are compared with that of the K-means algorithm and a simulated annealing (SA) approach. The amenability of the connectionist approach to parallelization enables effective use of parallel hardware.

**Improvements to the SMO algorithm for SVM regression**. With Shirish Shevade, Chiranjib Bhattacharyya and K.R. Krishna Murthy. IEEE Transactions on Neural Networks 2000. [abstract]This paper points out an important source of inefficiency in Smola and Scholkopf's Sequential Minimal Optimization (SMO) algorithm for SVM regression that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO for regression. These modified algorithms perform significantly faster than the original SMO on the datasets tried.

**Information geometry and Plefka's mean-field theory**. With Chiranjib Bhattacharyya. Journal of Physics A: Math. Gen. 2000. [abstract]An alternate derivation of Plefka's method is presented which is obtained from minimizing Gibbs energy. It is shown that this method can be rederived using a perturbation expansion of Kullback-Leibler divergence, the latter having a nice information geometric interpretation.

**Optimal control of a somersaulting platform diver: a numerical approach**. With Murthy Nukala, IEEE Conference on Robotics and Automation, 1993. [abstract]The somersaulting maneuver of a platform diver is studied. An effective numerical approach for obtaining an optimal solution for this motion is given. The diver is modeled as a planar system of interconnected multibodies, and controllability is proved in a sense dictated by the problem. A time-optimal control problem with state and control constraints is set up and solved using a numerical approach. The numerical solution agrees well with motions executed by professional divers.

**A fast procedure for computing the distance between complex objects in 3-dimensional space**. With Elmer Gilbert and Dan Johnson. IEEE Journal of Robotics and Automation, 1988.*The algorithm in this paper, The GJK Algorithm has great use in video games, computer graphics, robotics motionplanning etc. See this video.. the presenter seems to face great difficulty pronouncing my name!*[abstract]An algorithm for computing the Euclidean distance between a pair of convex sets in Rm is described. Extensive numerical experience with a broad family of polytopes in R3 shows that the computational cost is approximately linear in the total number of vertices specifying the two polytopes. The algorithm has special features which makes its application in a variety of robotics problems attractive. These features are discussed and an example of collision detection is given.

**Optimal infinite-horizon feedback laws for a general class of constrained discrete-time systems: stability and moving-horizon approximations**. With Elmer Gilbert. Journal of Optimization Theory and Its Applications, 1988. This is a highly cited paper in the area of Model Predictive Control (MPC). See this survey paper. [abstract]Stability results are given for a class of feedback systems arising from the regulation of time-varying discrete-time systems using optimal infinite-horizon and moving-horizon feedback laws. The class is characterized by joint constraints on the state and the control, a general nonlinear cost function and nonlinear equations of motion possessing two special properties. It is shown that weak conditions on the cost function and the constraints are sufficient to guarantee uniform asymptotic stability of both the infinite-horizon and moving-horizon feedback systems. The infinite-horizon cost associated with the moving-horizon feedback law approaches the optimal infinite-horizon cost as the moving-horizon is extended.

*Last updated: May, 2017*