DESCRIPTION
SVMlin is
software package for linear SVMs. It is well-suited to
classification
problems involving a large number of examples and features. It is
primarily written for sparse datasets (number of non-zero
features in an example is typically small). It is written in C++
(mostly C). A mex wrapper is available for MATLAB users.
SVMlin can
also utilize unlabeled data, in addition to labeled examples. It
currently implements two extensions of standard SVMs to incorporate
unlabeled examples.
For a Reuters text categorization
problem with around
804414 labeled examples and 47326 features, SVMlin takes
less than two minutes to train a linear SVM on an Intel machine with
3GHz
processor and 2GB RAM. Given just 1000 labels, it can utilize the
remaining hundreds of thousands of unlabeled examples for training a
semi-supervised linear SVM in about 20 minutes. Unlabeled
data
can be very useful in improving classification performance when labels
are relatively few.
SVMlin implements
the following algorithms (See references
[1],
[2]):
Fully supervised [using only labeled
examples]
* Linear Regularized Least Squares (RLS) Classification
* Modified Finite Newton Linear L2-SVMs
Semi-supervised [can use
unlabeled data as well]
* Linear
Transductive
L2-SVMs with multiple switchings
* Deterministic Annealing (DA) for Semi-supervised Linear L2-SVMs
DOWNLOAD SVMlin
The current release [v1.0 -- August
2006] is available for download in
the following formats (note: SVMlin needs updates based on recent user feedback):
svmlin-v1.0.zip
svmlin-v1.0.tar.gz
Dislaimer:
This software is
provided without any expressed
or implied warranty. In particular, there is no warranty of any
kind concerning the fitness of this software for any particular
purpose.
Conditions for Use: This software
is freely available for educational or commercial purposes. We expect
that
all publications describing work using this software quote at least one
of the references given below.
INSTALLATION
Unpack the files by typing
unzip svmlin.zip
tar -xvzf svmlin.tar.gz
This will create a directory called svmlin-v1.0 containing a Makefile
and
3 source files: ssl.h, ssl.cpp and svmlin.cpp.
Type
make
This will create an executable
svmlin
This single executable is used for training, testing and performance
evaluation.
HOW
TO USE
Data Files
Input examples (for
training and testing) need to be in
the
SVM-Light/
LIBSVM format
[except that there is no first column for labels].
Each line describes an example and is a list of feature index -- value
pairs for non-zero features, separated by white space. The line should
end with a newline '\n' character.
<feature>:<value> <feature>:<value> ...
<feature>:<value>
For example, the following data matrix with 4 examples and 5 features
0 3 0 0 1
4 1 0 0 0
0 5 9 2 0
6 0 0 5 3
is described in the input file as
2:3 5:1
1:4 2:1
2:5 3:9 4:2
1:6 4:5 5:3
The file containing labels is separate since it is routine to use the
same inputs with different labels. Each line should contain a label for
the corresponding line in the input file with one of the following
values:
+1 (labeled positive example)
-1 (labeled negative
example)
0 (unlabeled examples)
The implementation can
only handle binary classification in the current version.
Training
Type
svmlin [options] training_examples training_labels
The learnt weight vector is saved in
training_examples.weights.
The outputs of the weight vector on the training examples are saved in
training_examples.outputs.
Testing
Type
svmlin -f training_examples.weights test_examples_filename
Test examples should be in the same format as the training examples.
The weight vector in training_examples.weights is used for predictions.
The
predicted soft outputs are saved in test_examples.outputs.
Evaluation
If the labels for
the test examples are known, Typing
svmlin -f weights_filename test_examples_filename test_labels_filename
will also output the accuracy.
You can use the
PERF software
to compute other performance metrics.
Options
Typing
svmlin
without arguments produces the following help menu.
Usage
Train: svmlin [options] training_examples training_labels
Test: svmlin -f weights_file test_examples
Evaluate: svmlin -f weights_file test_examples test_labels
Options:
-A algorithm : set algorithm (default 1)
0 -- Regularized Least Squares (RLS) Classification
1 -- SVM (L2-SVM-MFN) (default choice)
2 -- Multi-switch Transductive SVM (using L2-SVM-MFN)
3 -- Deterministic Annealing Semi-supervised SVM (using L2-SVM-MFN)
-W regularization parameter lambda (default 1)
-U regularization parameter lambda_u (default 1)
-S maximum number of switches in TSVM (default 0.5*number of unlabeled examples)
-R positive class fraction of unlabeled data (0.5)
-f weights_filename (Test Mode: input_filename is the test file)
-Cp relative cost for positive examples (only available with -A 1)
-Cn relative cost for negative examples (only available with -A 1)
Example
Files
A subdirectory called
example contains the following
files training_examples,
training_labels, test_examples, test_labels.
training_labels specifies 50 labels among 1460 examples (unlabeled
examples are labeled "0").
For training the multi-switch Transductive SVM run,
svmlin -A 2 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels
This will create files in the current directory called
training_examples.weights, training_examples.outputs. To evaluate
performance on the test set, run,
svmlin -f training_examples.weights example/test_examples example/test_labels
On this dataset (a subset of 20
newgroups), multi-switch TSVM gives an
accuracy of 92.8%.
To run the deterministic annealing semi-supervised SVM, run,
svmlin -A 3 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels
Evaluating performance as above we see that deterministic annealing
gives an accuracy of 95.5%. While deterministic annealing can give
peformance improvements on some datasets, the multi-switch TSVM is
generally faster and can have comparable performance.
The options -A 0 or -A 1 ignores the unlabeled examples (if any) and
performs supervised learning.
MATLAB
Interface
SVMlin can
be run through MATLAB
using the mex interface implemented in svmlin_mex.cpp
and svmlin.m
Download these files in the directory svmlin-v1.0, invoke matlab and
compile the mex file as
mex svmlin_mex.cpp ssl.o
where ssl.o is the object file associated with ssl.cpp.
To see usage under MATLAB type,
help svmlin
RESEARCH
ON LARGE SCALE SEMI-SUPERVISED LEARNING
Datasets
Name
|
examples
|
features
|
average sparsity*
|
original source
|
aut-avn
real-sim
ccat
gcat
33-36
pcmac
full-ccat
full-gcat
kdd99
SecStr
|
71175
72309
23119
23119
82692
1946
804414
804414
4898431
1273151
|
20707
20958
47236
47236
59072
7511
47236
47236
127
315
|
51.32
51.32
75.93
75.93
26.56
54.58
76.7
76.7
17.3
15
|
SRAA
SRAA
RCV1-v2
RCV1-v2
Yahoo!
20
newsgroups
RCV1-v2
RCV1-v2
KDD
Cup 1999
SSL
Book
|
* average number of non-zero features per example
MATLAB files for the first 6 datasets
(except 33-36 Yahoo! dataset) can be downloaded from
here.
To retrieve the
experimental setting (data splits) and reproduce results of the
references
[1],
[2],
use the matlab file
Experiments.m
The same datasets in svm-light format are available
here.
REFERENCES