-
Notifications
You must be signed in to change notification settings - Fork 9
/
README
106 lines (82 loc) · 3.72 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
===============================
Fast K-means Clustering Toolkit
===============================
----------------------
Version 0.1 (Sat May 17 17:41:11 CDT 2014)
- Initial release.
----------------------
WHAT:
This software is a testbed for comparing variants of Lloyd's k-means clustering
algorithm. It includes implementations of several algorithms that accelerate
the algorithm by avoiding unnecessary distance calculations.
----------------------
WHO:
Greg Hamerly ([email protected], primary contact) and Jonathan Drake
----------------------
HOW TO BUILD THE SOFTWARE:
type "make" (and hope for the best)
----------------------
HOW TO RUN THE SOFTWARE:
The driver is designed to take commands from standard input, usually a file
that's been redirected as input:
./kmeans < commands.txt
You can read the source to find all the possible commands, but here is a
summary:
- threads T -- use T threads for clustering
- maxiterations I -- use at most I iterations; default (or negative)
indicates an unlimited number
- dataset D -- use the given path name to a file as the dataset for
clustering. The dataset should have a first line with the number of points
n and dimension d. The next (nd) tokens are taken as the n vectors
to cluster.
- initialize k {kpp|random} -- use the given method (k-means++ or a random
sample of the points) to initialize k centers
- lloyd, hamerly, annulus, elkan, compare, sort, heap, adaptive -- perform
k-means clustering with the given algorithm (requires first having
initialized the centers). The adaptive algorithm is Drake's algorithm with
a heuristic for choosing an initial B
- drake B -- use Drake's algorithm with B lower bounds
- kernel [gaussian T | linear | polynomial P] -- use kernelized k-means with
the given kernel
- elkan_kernel [gaussian T | linear | polynomial P] -- use kernelized
k-means with the given kernel, and Elkan's accelerations
- center -- give the previously-loaded dataset a mean of 0.
- quit -- quit the program
Note that when a set of centers is initialized, that same set of centers is used
from then on (until a new initialization occurs). So running a clustering
algorithm multiple times will use the same initialization each time.
Here is an example of a simple set of commands:
dataset smallDataset.txt
initialize 10 kpp
annulus
hamerly
adaptive
heap
elkan
sort
compare
----------------------
CAVEATS:
- This software has been developed and tested on Linux. Other platforms may not
work. Please let us know if you have difficulties, and if possible fixes for
the code.
- This software uses a non-standard pthreads function called
pthread_barrier_wait(), which is implemented on Linux but not on OSX.
Therefore, multithreading doesn't currently work on OSX. To turn it off,
comment out the lines in the Makefile that say:
CPPFLAGS += -DUSE_THREADS
LDFLAGS += -lpthread
----------------------
REFERENCES:
Phillips, Steven J. "Acceleration of k-means and related clustering algorithms."
In Algorithm Engineering and Experiments, pp. 166-177. Springer Berlin
Heidelberg, 2002.
Elkan, Charles. "Using the triangle inequality to accelerate k-means." In ICML,
vol. 3, pp. 147-153. 2003.
Hamerly, Greg. "Making k-means Even Faster." In SDM, pp. 130-140. 2010.
Drake, Jonathan, and Greg Hamerly. "Accelerated k-means with adaptive distance
bounds." In 5th NIPS Workshop on Optimization for Machine Learning. 2012.
Drake, Jonathan. "Faster k-means clustering." MS thesis, 2013.
Hamerly, Greg, and Jonathan Drake. "Accelerating Lloyd's algorithm for k-means
clustering." To appear in Partitional Clustering Algorithms, Springer, 2014.