SparseLOGREG: A Simple and Efficient Algorithm for Gene Selection Using Sparse Logistic Regression

This documentation contains some information about SparseLOGREG, an efficient implementation of the sparse logistic regression algorithm discussed in the technical report,

S. K. Shevade and S. S. Keerthi (2002), A Simple and Efficient Algorithm for Gene Selection using Sparse Logistic Regression, Technical Report No. CD-02-22, Control Division, Department of Mechanical Engineering, National University of Singapore, Singapore - 117 576.

A slightly modified version of the above report will appear as an article in the Bioinformatics journal.
Click here to download a postscript file that contains Supplementary Information associated with the above journal paper.

It is a good idea to read the report mentioned above before using the program described below.

How to use SparseLOGREG

Download SparseLOGREG.tar.gz. Unzip and untar this file.

 
  gunzip SparseLOGREG.tar.gz 
  tar xvf  SparseLOGREG.tar

When you tar xvf, you will get a directory called SparseLOGREG. This directory must contain two directories called bin and datasets. The former contains the source programs while the latter contains some sample datasets.

There are two main source files, "FindCounts.c" and "FindGenes.c" in the sub-directory bin. Create the executables of these files FindCounts and FindGenes, by executing the following commands:

 cd SparseLOGREG/bin
 make all

If this doesn't work, you may have to edit the Makefile in the bin directory to adjust the compiler settings.

Note that some of the programs, nrutil.c, nrutil.h, ran1.c, and sort.c are taken from the Numerical Recipes in C software library. These minor routines are used by SparseLOGREG to handle memory allocation, deallocation, random number generation and sorting.

Input Specification:

Both the executable programs read the input from the file, "in.txt". The syntax of this file is given below. Every line in this file begins with a string (without any blank character) followed by the actual inputs. The users are expected to specify the inputs in the same order as given below.

InputDataFile

Specifies the input file which contains the training samples. The data file should be in ASCII format; every row in the file represents one training example while the columns represent features. Every row ends with a class label (+1 or -1). All the entries in this file are separated by blank or tab.
Normalization: It is a good idea to normalize the input to zero mean and unit variance, first for each sample (over all features) and then for each feature (over all samples). This step yields good results for gene microarray data.

NoOfExamples

Denotes the number of training examples in the dataset.

InputDimension

Represents the number of features characterizing each sample.

NoOfExpts

Indicates the number of times the cross-validation experiments are repeated.
A good choice for this variable is 100.

Gamma

Specifies the range of gamma which needs to be tried (for cross-validation purposes). For example, the input line,

gamma .01 4 5

indicates that 5 different values of gamma in the interval [.01, 4] will be used by the program for cross-validation purposes.

InternalKfold

Indicates the number of folds used for cross-validation. A good choice for this number is 3.

Tol

This corresponds to the tolerance; 0.000001 is a good choice for it.

FeaturesFile

Specifies the name of the file where the ranked features will be stored. This file is an output file for the program FindCounts and the input file for FindGenes. The features are ranked according to the relevance count by the program FindCounts and the ranked features are stored in descending order of their relevance counts. The program, FindGenes, uses the same file to decide the final classifier.

The first line of this file denotes the number of features with nonzero relevance count. This is followed by the the feature number and its relevance count, arranged in descending order of relevance count.

We suggest the user not to alter this file since it forms the input for the program FindGenes.

If the user only wants to know the feature rankings and say, use the top ranked features for some other purposes, he/she can do it by running only the program FindCounts and extracting the required number of features from FeaturesFile.

ClassifierFile

This file name represents the output file where the final classifier model is stored by the program FindGenes. This input is not used by the program FindCounts.

The file gives the following: the average validation error for every feature added; and the final classifier design, with the feature number followed by the corresponding weight, and the value of the final bias term. See the technical report for the details.

MaxNoOfFeatures

Denotes the maximum number of features one wants to include in a final classifier. We found that 20 is a good choice for this input. This input is not used by the program FindCounts.

Note that the file "in.txt" should reside in a directory from which the commands FindCounts and FindGenes are executed. One sample of this file is given in SparseLOGREG/datasets/colon directory.

Demo:

Execute the programs from the SparseLOGREG/datasets/colon directory by typing the following commands:

  ../../bin/FindCounts 
  ../../bin/FindGenes

You should get the FeaturesFile containing the ranked features and the ClassifierFile which contains the final classifier.

This software is being made available free only for non-commercial use. Please cite the Technical Report Reference mentioned above if you use this program.

In case of any problems associated with this software, send me an e-mail.