Musings and trials of a PhD tyro: Semisupervised Learning Using Negative Labels

Semisupervised Learning Using Negative Labels

Authors: Chenping Hou et al.

Journal: IEEE Transactions on Neural Networks, Volume: 22 Issue: 3

Summary:

The paper talks about a special type of graph based semisupervised learning (SSL) for classification task where in data samples with no labels and negative labels are used during the learning phase, in addition to samples with known class labels, and in particular uses samples with negative labels – which depicts the information of whether a data sample does not belong to a given class - to improve classification accuracy. The paper provides an iterative label propagation algorithm, NLP, which after convergence, can be used to find the class label of data points with no label and negative label.

Strengths:

The method is novel in that it uses samples with no labels too during the the learning phase, when the traditional classifier uses only the labeled samples and has a separate testing phase to find the labels of test samples. The proposed method allows for intelligent integration of multiple negative labels for a given sample that, when combined with matrix approaches, allows for convenient model formulation which further aided in deriving closed form solution, and performing convergence analysis. The very formulation allows for out-of-sample extension, one of the key issues in ML community. The tuning of parameter σ, indirectly through τ, helped to keep classification accuracy under control for various varying values of τ (as shown in experimental results) because of the σ's dependence on k and geometry of data points. Significantly higher classification accuracy in all the experiments by NLP shows that learning over all points irrespective of the availability of their class information is important - this is also discussed by the authors while presenting classification results.

Weaknesses:

While the parameters A and Y are shown to be helpful in label propagation, careful analysis of algorithm clearly shows that there is no label propagation at all and, NLP, at its crux, sees no difference between samples with negative labels and no label, because of the way it operates and sets Y and A.
1. For example, consider the second point discussed under the rule for label updation. When the data sample does not belong to j-th class, the first term in equation 9 that helps for label propagation becomes negligible and the second term goes to zero, which effectively means no label propagation and towards the end of the same point, the authors explain about label propagation when there is no prior information about the data sample by considering it as an unlabeled sample.
2. In other words, label propagation happens only when the data sample is unlabeled or when there is no enough prior information from the NL sample. The term for label propagation helps to update/approximate labels of unlabeled samples through P and F, which are nothing but combinations of weight matrix and labels, with their values dominated by labeled samples from the neighborhood scaled by A and this makes NLP, an improved version of K-nearest neighbor. This equivalence with K-NN is also mentioned by the authors, albeit indirectly, while they mention one-shot NLP.
The effect of noise in the dataset is not discussed and it is to be expected that noisy samples might complicate the situation because of existence of different types of points.
While it was an innovative approach to get rid of σ, the authors could not make the algorithm completely parameter-proof as σ in turn depends on k, which is a parameter to the algorithm. This shows that parameters and their tuning is still an active field of research in ML.
It is clear from the section that discusses accuracies with different NL selections that mere increase in NL points do not guarantee increase in classification accuracy - in fact from table XII and XIII, it is clear that it dips. The paper lacks in-depth discussion about balance between number of TLs and NLs which is crucial for understanding of the algorithm over different datasets. From figure 2, it is clear that after a specified number of NL points, the classification accuracy saturates.

Next steps:

The most important observation pertaining to this algorithm is its good classification accuracy when compared to all others. As mentioned previously, each point brings in some information and that information when iterated over leads to better results. Such an approach needs to be verified for its applicability in other supervised learning approaches, where only the points with label information are used for training, effectively using only part of sample space. As is mentioned by authors too, more points lead to better accuracy.
It is clear from the tables XVI and XVII that NLP is the second highest in terms of computational time. As the labels of TL points are maintained consistently throughout the algorithm, and there are applications where TL points form significant portion of the dataset, the iterations over those points could be avoided, as in any case their labels are neither updated nor queried. This way the time could be significantly improved.
The relationship between number of TLs and NLs can be analyzed further, not just for improving NLP, but such analysis could bring forth other interesting observations.

Questions for discussion:

It would be interesting to check if the conventional one-vs-the rest multi-category classification tasks can be improved by incorporating additional samples with negative labels (that do not normally get incorporated in the training phase), as belonging to the 'the-rest' category.
The σ determination through k and d-cap can be further analyzed for use in many kernel based ML methods including, but not limited to, SVMs, Kernel PCA etc, as these methods require the use of brute-force cross-validation routine to help determine parameter values.
The authors, at multiple instances, speak about equivalence of TL points and NL points with more NLs. It would be interesting to see if the existing multi-category classification datasets can be further investigated to identify the classes the samples do not belong rather than giving exactly one class label per sample, and verifying if the classification accuracies could be improved by appropriate modification to the learner under consideration.

Musings and trials of a PhD tyro

Saturday, February 18, 2012

Semisupervised Learning Using Negative Labels

No comments:

Post a Comment