Skip to content

Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.

License

Notifications You must be signed in to change notification settings

derrickburns/generalized-kmeans-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

e1f21b1 · Jan 19, 2024
Jan 15, 2024
Jan 15, 2024
Jan 18, 2024
Mar 25, 2015
Jan 19, 2024
Jan 18, 2024
Mar 6, 2015
Jan 20, 2015
Jan 18, 2024
Mar 6, 2015
Jan 18, 2024
Jan 15, 2024
Mar 30, 2015
Jan 18, 2024
Jan 19, 2024

Repository files navigation

Generalized K-Means Clustering

This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. Most practical variants of K-means clustering are implemented or can be implemented with this package, including:

If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!

This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!