Saturday, February 18, 2012

Minimum spanning tree partitioning algorithm for microaggregation

Authors: Michael Laszlo,Sumitra Mukherjee

Journal: IEEE Transactions on Knowledge and Data Engineering
Volume: 17 Issue:7
Issue date: July 2005



Summary:
The paper discusses a heuristic way of clustering a set of points by partitioning the corresponding Minimum Spanning Tree (MST) of the underlying complete graph with a constraint on the minimum number of points per cluster such that there is minimal loss of information. The algorithm indirectly minimizes the sum of the within group squared error, a common objective of clustering algorithms, through three steps of MST construction, edge cutting and cluster formation.
Strengths:
  1. The paper discusses time complexity analysis for each algorithm which helps to understand the scale of the problem.
  2. Application to microaggregation introduces another problem to ML community – instead of specifying the number of clusters, the constraint is on the number of points in the cluster, which indirectly controls the number of clusters.
  3. The heuristics are such that integration with any other method is straightforward and all the subroutines used in the paper use established algorithms like MST and data-structures like priority queue etc.
Weaknesses:
  1. While the algorithm presents a new flavor of approaching from Tree/Graph partitioning perspective, it has compromised severely on the time complexity, by introducing the MST construction step. It is clear that the last two steps are present in most of the clustering algorithms in some form or the other. The authors argue that when the points have well-defined inherent clusters, their approach perform well, but fail to see that an iterative simple-to-implement k-means with appropriate modifications, can do the same operation.
  2. Some of the running time reported are quadratic in number of the data points, which clearly make this algorithm unsuitable for large real world datasets.
  3. Performance (information loss) is good only for data sets with well separated clusters, but in practice such well defined clusters rarely occur and as authors point out simple algorithms like D or C, have good minimum information loss for arbitrary shaped cloud of points.
  4. SSE criterion is best suited for Gaussian type cloud of points; for other type of clustered points, author fail to discuss alternative objective criterion. It is not clear what are types of clusters found in the datasets like Creta discussed in the paper.
  5. While euclidean distance is the most common measure, there are other types of distance measure that might be suited for microaggregation problem.
  6. As authors themselves point out, there is no upper bound on the group size and they are forced to resort to combination of fixed sized methods with MST partitioning to control the upper bound.
Next steps:
  1. When there are clearly separated clusters in the dataset, a simple clustering algorithms can be used to capture the clusters and then as follow-up step to control k, the techniques discussed in the paper can be used. This will check the increasing time complexity.
  2. It would be an interesting experiment to reverse the steps of M-d and M-c methods and to apply the fixed-size methods first and apply MST on the resultant clusters. The time for their technique is shooting up because of running the MST on the whole dataset, which is shown by the authors too. Applying MST and Edge cutting on the reduced dataset (clusters) might help to improve efficiency.
  3. Instead of heuristics, it is needs to be seen if directly solving the optimization problem of minimizing SSE with the given set of constraints yield better results.
Questions for discussion:

  1. It would be interesting to see if the existing clustering algorithms like k-means, agglomerative clustering etc. could be adapted to include the fixed-number-of-points-per-cluster. For example, in many implementations, k-means is restarted with different initial seeds multiple times till a specified minimum number of points fall within each group.
  2. When there are inherent clusters in the dataset, it is hard to fix k, the number of points in the cluster and in that case, the information loss cannot be controlled too. The problem of fixing the value increases when the clusters have soft boundaries – this is clear from the last few columns of table 7.
  1. The table 6 clearly show that this technique gives large portion of over-sized clusters for the first class of datasets from sim_1 to sim_8, and also average size of those clusters is large. It brings up a questions as to how is it possible to get lower information loss (table 5 – last column) from this technique?
  2. In section 3.3, instead of iteratively applying the procedure classifypoint to every data point, another way to utilize the existing information would be to use the descendant count value to capture the roots of the trees in the forest and assign the points during the traversal of each rooted tree to a particular group. Whether this alternative is efficient or not remains to be seen.

No comments:

Post a Comment