SVMlin 
  Fast  Linear  SVM  Solvers  for Supervised  and Semi-supervised  Learning


Author:
Vikas Sindhwani 

Department of Computer Science
University of Chicago


DESCRIPTION

SVMlin  is software package for linear SVMs. It is well-suited to classification problems involving a large number of examples and features.  It is primarily written for sparse  datasets (number of non-zero features in an example is typically small).  It is written in C++ (mostly C). A mex wrapper is available for  MATLAB users.

SVMlin  can also utilize unlabeled data, in addition to labeled examples. It currently implements two extensions of standard SVMs to incorporate unlabeled examples.

For a Reuters text categorization problem with around 804414 labeled examples and 47326 features,  SVMlin  takes less than two minutes to train a linear SVM on an Intel machine with 3GHz processor and 2GB RAM.  Given just 1000 labels, it can utilize the remaining hundreds of thousands of unlabeled examples for training a semi-supervised linear SVM in about  20 minutes.  Unlabeled data can be very useful in improving classification performance when labels are relatively few.

SVMlin  implements the following algorithms (See references [1],[2]):

Fully supervised [using only labeled examples]
 
* Linear  Regularized Least Squares (RLS) Classification
* Modified Finite Newton  Linear L2-SVMs

Semi-supervised [can use unlabeled data as well]

* Linear Transductive L2-SVMs  with multiple switchings
* Deterministic Annealing (DA) for Semi-supervised Linear L2-SVMs

DOWNLOAD  SVMlin

The current release [v1.0 -- August 2006] is available for download in the following formats (note: SVMlin needs updates based on recent user feedback):

  svmlin-v1.0.zip 
  svmlin-v1.0.tar.gz

Dislaimer:  This software is provided without any expressed or implied warranty. In particular, there is no warranty of any kind concerning the fitness of this software for any particular purpose.

Conditions for Use: This software is freely available for educational or commercial purposes. We expect that all publications describing work using this software quote at least one of the references given below.


INSTALLATION

Unpack the files by typing
     unzip svmlin.zip
tar -xvzf svmlin.tar.gz

This will create a directory called svmlin-v1.0 containing a Makefile and 3 source files: ssl.h, ssl.cpp and svmlin.cpp. 

Type
     make
This will create an executable
       svmlin
This single executable is used for training, testing and performance evaluation.


HOW TO USE

Data Files
 
Input examples (for training and testing) need to be in the SVM-Light/LIBSVM format [except that there is no first column for labels].

Each line describes an example and is a list of feature index -- value pairs for non-zero features, separated by white space. The line should end with a newline '\n' character.

     <feature>:<value> <feature>:<value> ... <feature>:<value>

For example, the following data matrix with 4 examples and 5 features
0  3  0  0  1
4  1  0  0  0
0  5  9  2  0
6  0  0  5 3

is described in the input file as

2:3 5:1
1:4 2:1
2:5 3:9 4:2
1:6 4:5 5:3


The file containing labels is separate since it is routine to use the same inputs with different labels. Each line should contain a label for the corresponding line in the input file with one of the following values: 
+1 (labeled positive example)
-1  (labeled negative example)
0   (unlabeled examples)

The implementation can only handle binary classification in the current version.


Training

Type
	svmlin [options] training_examples training_labels

The learnt weight vector is saved in training_examples.weights.
The outputs of the weight vector on the training examples are saved in training_examples.outputs.

Testing

Type
	svmlin -f training_examples.weights test_examples_filename

Test examples should be in the same format as the training examples.
The weight vector in training_examples.weights is used for predictions.
The predicted soft outputs are saved in test_examples.outputs.

Evaluation
 
If the labels for the test examples are known, Typing
       svmlin -f weights_filename test_examples_filename test_labels_filename
will also output the accuracy. 
You can use the PERF software to compute other performance metrics.


Options
 

Typing
      svmlin 
 without arguments produces the following help menu.

Usage
Train: svmlin [options] training_examples training_labels
Test: svmlin -f weights_file test_examples
Evaluate: svmlin -f weights_file test_examples test_labels

Options:
-A algorithm : set algorithm (default 1)
0 -- Regularized Least Squares (RLS) Classification
1 -- SVM (L2-SVM-MFN) (default choice)
2 -- Multi-switch Transductive SVM (using L2-SVM-MFN)
3 -- Deterministic Annealing Semi-supervised SVM (using L2-SVM-MFN)
-W regularization parameter lambda (default 1)
-U regularization parameter lambda_u (default 1)
-S maximum number of switches in TSVM (default 0.5*number of unlabeled examples)
-R positive class fraction of unlabeled data (0.5)
-f weights_filename (Test Mode: input_filename is the test file)
-Cp relative cost for positive examples (only available with -A 1)
-Cn relative cost for negative examples (only available with -A 1)

Example  Files

A subdirectory called example contains the following files  training_examples, training_labels, test_examples, test_labels.
training_labels specifies 50 labels among 1460 examples (unlabeled examples are labeled "0").

For training the multi-switch Transductive SVM run,
      svmlin -A 2 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels
This will create files in the current directory called training_examples.weights, training_examples.outputs. To evaluate performance on the test set, run,
     svmlin -f  training_examples.weights example/test_examples example/test_labels

On this dataset (a subset of 20 newgroups), multi-switch TSVM gives an accuracy of 92.8%.  To run the deterministic annealing semi-supervised SVM, run,
     svmlin -A 3 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels
Evaluating performance as above we see that deterministic annealing gives an accuracy of 95.5%. While deterministic annealing can give
peformance improvements on some datasets, the multi-switch TSVM is generally faster and can have comparable performance.

The options -A 0 or -A 1 ignores the unlabeled examples (if any) and performs supervised learning.


MATLAB  Interface

SVMlin   can be run through MATLAB  using the mex interface implemented in svmlin_mex.cpp and svmlin.m
Download these files in the directory svmlin-v1.0, invoke matlab and compile the mex file as
   mex svmlin_mex.cpp ssl.o
where ssl.o is the object file associated with ssl.cpp. 
To see usage under  MATLAB type,
         help svmlin


RESEARCH ON LARGE SCALE SEMI-SUPERVISED LEARNING


Datasets


Name
examples
features
average sparsity*
original source
aut-avn
real-sim
ccat
gcat
33-36
pcmac
full-ccat
full-gcat
kdd99
SecStr
71175
72309
23119
23119
82692
1946
804414
804414
4898431
1273151
20707
20958
47236
47236
59072
7511
47236
47236
127
315
51.32
51.32
75.93
75.93
26.56
54.58
76.7
76.7
17.3
15
SRAA
SRAA
RCV1-v2
RCV1-v2
Yahoo!
20 newsgroups
RCV1-v2
RCV1-v2
KDD Cup 1999
SSL Book
* average number of non-zero features per example

MATLAB files for the first 6 datasets (except 33-36 Yahoo! dataset) can be downloaded from here.  To retrieve the
experimental setting (data splits) and reproduce results of the references [1],[2], use the matlab file Experiments.m
The same datasets in svm-light format are available here.



REFERENCES
[1] Vikas Sindhwani and Sathiya Keerthi.
  Large Scale Semi-supervised Linear SVMs.
Proceedings of ACM SIGIR, 2006
[2] Vikas Sindhwani and Sathiya Keerthi.
Newton Methods for Fast Solution of Semi-supervised Linear SVMs.
Book Chapter in Large Scale Kernel Machines, MIT Press, 2006  

HISTORY

v1.0 -- August 2006 release

COMMENTS/BUG REPORTS

Please send me an email at vikass at cs dot uchicago dot edu.

ACKNOWLEDMENTS

I thank Partha Niyogi and Sathiya Keerthi for support.
 Sumeet Chawla  and Chih-Jen Lin provided helpful comments.
I was inspired from LIBSVM and SVM-Light in the organization of the code.