K-means Clustering

Static Badge Static Badge

K-means clustering is one of the basic algorithms for clustering. In this project you will solve a clustering algorithm using GPS coordinates of locations in Europe. Each location includes a capacity. The locations and capacities are garbage collection facilities. Their capacity is the annual amount of waste accumulated in tonnes. The result of the k-means clustering will be a sub-optimal placement of k facilities that process the accumulated waste. The optimization minimizes the distance between accumulation sites and processing plants. The problem must be generalized to provide such that it solves an arbitrary amount of facilities. More details are given in the implementation guidelines. Also, refer to https://en.wikipedia.org/ wiki/K-means_clustering for details about k-means clustering.

Implementation guidelines

  1. Implementation of the graphical interface:

    • The program should have parameters with which the graphical interface can be toggled to achieve best performance

    • The drawing of the graphics should be done independently of the computational threads.

    • The default size of the window is 800x600px but can be adjusted manually.

    • The graphical interface must use a library to display maps (i.e. openstreetmaps). A good choice would be using a java wrapper for Leaflet.

    • The graphical interface can scale, zoom and move. Initially, it scales to fit all points in the view-port.

  2. Running the program:

    • The program can be ran in different modes (sequential, parallel, distributed) by specifying a parameter.

    • User can specify the number of accumulation sites and the number of clusters.

    • The program measures run-time needed to complete.

  3. Problem specific implementation requirements

    • Every version (sequential, parallel, and distributed) measures cycles passed. Every update of positions of all particles is considered a cycle. All three implementations run the simulation until they reach a specified number of cycles.

    • In case the number of sites is greater then the provided dataset, their positions and capacities must be chosen at random. Capacities should have an upper bound equal to the sample dataset. The positions must be in Europe and even though the equidistant distance can be used.

    • The implementation must adapt automatically to the hardware it is being ran on (Physical CPU's, Cores, Memory, etc..);

Testing

The report must include extensive testing and explanation of results (numeric and graphical). All three versions must be tested. The tests should be performed without drawing graphics. The parameters that influence the runtime are the number of accumulation sites and number of clusters. Consequently, both need to be tested independently to show how the implementation scales. Present the results with informative charts/figures and explain them in detail. The implementation should be tested in the following way:

Supporting material

The suppoting data can be downloaded from HERE

The dataset is a zip-compressed JSON formatted list of waste wood accumulation sites. Each site is of the following structure:

{"name":"Ruhland","capacity":75.25194920635,"la":51.4667,"lo":13.8667},
{"name":"Tettau","capacity":44.38621959541,"la":51.4333,"lo":13.7333},
{"name":"Guteborn","capacity":11.0917739618,"la":51.4167,"lo":13.9333}

Properties "la" and "lo" denote the latitude and longitutde of the location. The capacity is in tonnes of waste wood accumulated anually.