Musings and trials of a PhD tyro: February 2012

Saturday, February 18, 2012

Bayesian Multi-Task Reinforcement Learning

Authors: Alessandro Lazaric, Mohammad Ghavamzadeh

Conference: ICML, 2010

Summary:

The paper talks about multi-task Reinforcement Learning(RL) in an environment where the number of samples for a given task is limited in number because of the policy involved. This work assumes that the tasks share a similar structure and hence the corresponding value functions (vfs) are sampled from a common prior. Because of this assumption, the authors are able to do joint learning of vfs, in both cases of vfs from same task class or not. The paper stand out from others in its usage of Hierarchical Bayesian approach to model the distribution over vfs in parametric and non-parametric setting.

Strengths:

Generative models and inference algorithms for both cases of learning (symmetric and asymmetric) considered.
Modeling of value function similarity by HBM.
Different modes of learning: symmetric parametric and asymmetric non-parametric learning
Almost all the key machine learning areas like regression, Sampling, Bayesian modeling, Expectation Maximization, Dirichlet Process etc are touched upon here making it a paper with sound theoretical arguments.
Transfer of information from the joint distribution of vfs to learn the value function for new task.

Weaknesses:

Authors have compared three paradigms of STL, MCMTL and SCMTL but failed to compare, on the benchmark problems, how the other related techniques perform (given that they have quoted considerable number of related works) or even further, since the authors have significantly adapted ideas from literature, they could have given a comparison of BMTL with already published results.
The sampling techniques are computationally expensive and they are employed for asymmetric settings. Discussion of time complexity would have helped.
The paper appeared to be an amalgamation of already established techniques, combining them in some new combination and hence it had frequent referrals to old papers for all important parameters and results which made its reading hard. In that sense, the paper is not self contained.
No clear experimental setup to corroborate the ability to handle undefined number of classes.
It is surprising to see that the performance dips in all cases when the number of samples increase. While it is good to see that for limited samples and increase in number of tasks, the methods do well, the proposed method should be improved to take into account large number of samples, if available.

Next steps/Discussion:

Referring to figure 5c, it would be good to have discussion about why MCMTL fails when the number of tasks is limited in number.
It is clear that there is some kind of transfer learning happening while learning the value function of a newly observed task. It would be interesting to analyze under what paradigm of transfer learning this paper falls into.
It would be useful to know types of features usually considered for representing vfs in RL, esp for benchmark problems like inverted pendulum.
Since RL is predominantly used in Robotics, it would be good to know a real world example where the vfs are from same prior.
How is simple Gaussian processes different from GPTD?

Minimum spanning tree partitioning algorithm for microaggregation

Authors: Michael Laszlo,Sumitra Mukherjee

Journal: IEEE Transactions on Knowledge and Data Engineering
Volume: 17 Issue:7
Issue date: July 2005

Summary:

The paper discusses a heuristic way of clustering a set of points by partitioning the corresponding Minimum Spanning Tree (MST) of the underlying complete graph with a constraint on the minimum number of points per cluster such that there is minimal loss of information. The algorithm indirectly minimizes the sum of the within group squared error, a common objective of clustering algorithms, through three steps of MST construction, edge cutting and cluster formation.

Strengths:

The paper discusses time complexity analysis for each algorithm which helps to understand the scale of the problem.
Application to microaggregation introduces another problem to ML community – instead of specifying the number of clusters, the constraint is on the number of points in the cluster, which indirectly controls the number of clusters.
The heuristics are such that integration with any other method is straightforward and all the subroutines used in the paper use established algorithms like MST and data-structures like priority queue etc.

Weaknesses:

While the algorithm presents a new flavor of approaching from Tree/Graph partitioning perspective, it has compromised severely on the time complexity, by introducing the MST construction step. It is clear that the last two steps are present in most of the clustering algorithms in some form or the other. The authors argue that when the points have well-defined inherent clusters, their approach perform well, but fail to see that an iterative simple-to-implement k-means with appropriate modifications, can do the same operation.
Some of the running time reported are quadratic in number of the data points, which clearly make this algorithm unsuitable for large real world datasets.
Performance (information loss) is good only for data sets with well separated clusters, but in practice such well defined clusters rarely occur and as authors point out simple algorithms like D or C, have good minimum information loss for arbitrary shaped cloud of points.
SSE criterion is best suited for Gaussian type cloud of points; for other type of clustered points, author fail to discuss alternative objective criterion. It is not clear what are types of clusters found in the datasets like Creta discussed in the paper.
While euclidean distance is the most common measure, there are other types of distance measure that might be suited for microaggregation problem.
As authors themselves point out, there is no upper bound on the group size and they are forced to resort to combination of fixed sized methods with MST partitioning to control the upper bound.

Next steps:

When there are clearly separated clusters in the dataset, a simple clustering algorithms can be used to capture the clusters and then as follow-up step to control k, the techniques discussed in the paper can be used. This will check the increasing time complexity.
It would be an interesting experiment to reverse the steps of M-d and M-c methods and to apply the fixed-size methods first and apply MST on the resultant clusters. The time for their technique is shooting up because of running the MST on the whole dataset, which is shown by the authors too. Applying MST and Edge cutting on the reduced dataset (clusters) might help to improve efficiency.
Instead of heuristics, it is needs to be seen if directly solving the optimization problem of minimizing SSE with the given set of constraints yield better results.

Questions for discussion:

It would be interesting to see if the existing clustering algorithms like k-means, agglomerative clustering etc. could be adapted to include the fixed-number-of-points-per-cluster. For example, in many implementations, k-means is restarted with different initial seeds multiple times till a specified minimum number of points fall within each group.
When there are inherent clusters in the dataset, it is hard to fix k, the number of points in the cluster and in that case, the information loss cannot be controlled too. The problem of fixing the value increases when the clusters have soft boundaries – this is clear from the last few columns of table 7.

The table 6 clearly show that this technique gives large portion of over-sized clusters for the first class of datasets from sim_1 to sim_8, and also average size of those clusters is large. It brings up a questions as to how is it possible to get lower information loss (table 5 – last column) from this technique?
In section 3.3, instead of iteratively applying the procedure classifypoint to every data point, another way to utilize the existing information would be to use the descendant count value to capture the roots of the trees in the forest and assign the points during the traversal of each rooted tree to a particular group. Whether this alternative is efficient or not remains to be seen.

Semisupervised Learning Using Negative Labels

Semisupervised Learning Using Negative Labels

Authors: Chenping Hou et al.

Journal: IEEE Transactions on Neural Networks, Volume: 22 Issue: 3

Summary:

The paper talks about a special type of graph based semisupervised learning (SSL) for classification task where in data samples with no labels and negative labels are used during the learning phase, in addition to samples with known class labels, and in particular uses samples with negative labels – which depicts the information of whether a data sample does not belong to a given class - to improve classification accuracy. The paper provides an iterative label propagation algorithm, NLP, which after convergence, can be used to find the class label of data points with no label and negative label.

Strengths:

The method is novel in that it uses samples with no labels too during the the learning phase, when the traditional classifier uses only the labeled samples and has a separate testing phase to find the labels of test samples. The proposed method allows for intelligent integration of multiple negative labels for a given sample that, when combined with matrix approaches, allows for convenient model formulation which further aided in deriving closed form solution, and performing convergence analysis. The very formulation allows for out-of-sample extension, one of the key issues in ML community. The tuning of parameter σ, indirectly through τ, helped to keep classification accuracy under control for various varying values of τ (as shown in experimental results) because of the σ's dependence on k and geometry of data points. Significantly higher classification accuracy in all the experiments by NLP shows that learning over all points irrespective of the availability of their class information is important - this is also discussed by the authors while presenting classification results.

Weaknesses:

While the parameters A and Y are shown to be helpful in label propagation, careful analysis of algorithm clearly shows that there is no label propagation at all and, NLP, at its crux, sees no difference between samples with negative labels and no label, because of the way it operates and sets Y and A.
1. For example, consider the second point discussed under the rule for label updation. When the data sample does not belong to j-th class, the first term in equation 9 that helps for label propagation becomes negligible and the second term goes to zero, which effectively means no label propagation and towards the end of the same point, the authors explain about label propagation when there is no prior information about the data sample by considering it as an unlabeled sample.
2. In other words, label propagation happens only when the data sample is unlabeled or when there is no enough prior information from the NL sample. The term for label propagation helps to update/approximate labels of unlabeled samples through P and F, which are nothing but combinations of weight matrix and labels, with their values dominated by labeled samples from the neighborhood scaled by A and this makes NLP, an improved version of K-nearest neighbor. This equivalence with K-NN is also mentioned by the authors, albeit indirectly, while they mention one-shot NLP.
The effect of noise in the dataset is not discussed and it is to be expected that noisy samples might complicate the situation because of existence of different types of points.
While it was an innovative approach to get rid of σ, the authors could not make the algorithm completely parameter-proof as σ in turn depends on k, which is a parameter to the algorithm. This shows that parameters and their tuning is still an active field of research in ML.
It is clear from the section that discusses accuracies with different NL selections that mere increase in NL points do not guarantee increase in classification accuracy - in fact from table XII and XIII, it is clear that it dips. The paper lacks in-depth discussion about balance between number of TLs and NLs which is crucial for understanding of the algorithm over different datasets. From figure 2, it is clear that after a specified number of NL points, the classification accuracy saturates.

Next steps:

The most important observation pertaining to this algorithm is its good classification accuracy when compared to all others. As mentioned previously, each point brings in some information and that information when iterated over leads to better results. Such an approach needs to be verified for its applicability in other supervised learning approaches, where only the points with label information are used for training, effectively using only part of sample space. As is mentioned by authors too, more points lead to better accuracy.
It is clear from the tables XVI and XVII that NLP is the second highest in terms of computational time. As the labels of TL points are maintained consistently throughout the algorithm, and there are applications where TL points form significant portion of the dataset, the iterations over those points could be avoided, as in any case their labels are neither updated nor queried. This way the time could be significantly improved.
The relationship between number of TLs and NLs can be analyzed further, not just for improving NLP, but such analysis could bring forth other interesting observations.

Questions for discussion:

It would be interesting to check if the conventional one-vs-the rest multi-category classification tasks can be improved by incorporating additional samples with negative labels (that do not normally get incorporated in the training phase), as belonging to the 'the-rest' category.
The σ determination through k and d-cap can be further analyzed for use in many kernel based ML methods including, but not limited to, SVMs, Kernel PCA etc, as these methods require the use of brute-force cross-validation routine to help determine parameter values.
The authors, at multiple instances, speak about equivalence of TL points and NL points with more NLs. It would be interesting to see if the existing multi-category classification datasets can be further investigated to identify the classes the samples do not belong rather than giving exactly one class label per sample, and verifying if the classification accuracies could be improved by appropriate modification to the learner under consideration.

Musings and trials of a PhD tyro

Saturday, February 18, 2012

Bayesian Multi-Task Reinforcement Learning

Minimum spanning tree partitioning algorithm for microaggregation

Semisupervised Learning Using Negative Labels

Wednesday, February 15, 2012

The Hindu : Columns / Harsh Mander : Barefoot - The other side of life