Musings and trials of a PhD tyro: Active Learning for Networked Data

Authors: Mustafa Bilgic et al

Conference: ICML 2010

Summary:

The paper presents an active learning algorithm called ALFNET that takes advantage of explicit network structure in the data (in the form of labels of neighboring nodes) for collective classification of nodes, by selecting only informative examples to be labeled to perform efficiently on test nodes whose label is not available. The underlying assumption of the paper is that labels of the linked nodes are correlated, exploiting which, it is possible to get better performance than the traditional approach of using only the attributes of the nodes.

Highlights/Strengths:

Elegant combination of key concepts in machine learning like active learning, semi-supervised learning, dimensionality reduction over networked data
Active learning for collecting classification effort is one of its kind because the current day network data are in terabytes and obtaining label information for all nodes is impractical. Thus, this combination help reduce the cost of labeling significantly, without compromising on the accuracy.
P-values plots and t-tests, albeit at 0.1 significance level, are quite informative for comparative studies.
Dimensionality reduction of sparse binary feature vectors to get better accuracy.
Use of clustering as the initial step to logically separate the nodes in the network to have balanced training set and label acquisition from thereafter. Intuitively, it appears that independence of the label of a node on the attributes of non-neighbor nodes works because of this step. However, use of majority class in a cluster indirectly brings in contribution from non-neighbors too.
Because of the use of disagreement measure, ALFNET gives more importance to uncertain regions of the learning space.

Weaknesses:

Since the algorithm iterates over the all nodes of graph by approximating collective classification by local collective classification model, and since the network data are usually large, it is important to study the running time behavior of the algorithm and its convergence, but such an analysis is not done.
The number of experiments is not extensive enough to draw any observation conclusively. Also, the network datasets are not large enough in their original form and authors perform preprocessing to retain only connected components that further reduce the datasets' size. Experiments on such restricted datasets cannot be seen as generally applicable.
Consider their modeling assumption – if the labels of the neighboring nodes are known, then label of the node under consideration is independent of the attributes of neighbors and non-neighbors. Since they use semi-supervised learning to predict the labels of unobserved neighbors, a question arises as to what if the predicted labels are wrong and if so, how the error propagates from iteration to iteration.
It would have been informative to know the actual classes present in the Cora and Citeseer datasets. Given that papers are published in vast number of domains, it is not clear how small number of classes were found in these datasets.

Next steps / Questions for discussion:

It would be interesting to study the behavior of error propagation (wrong class labels by semi-supervised learning) to see if the iterates settle or converge, whatever the starting labels are.
The idea of extending this work to a directed graph is not straightforward because in that case dependency of nodes (and their labels) change when the direction of links change. While the notion of label or class is well-defined in scientific domain, it is not obvious in social network or biological domain. If a general approach for labeling can be obtained, it can be used to find communities in graph datasets.
While it appears on the surface that the network structure is used, the classifier CC, a key player in ALFNET, is fed only with aggregated measure of the neighborhood in the form additional attributes. It would be interesting to see how the ALFNET performs if a true relation learner is used in place of CC, since the reported accuracy results, even with 90% confidence, are not close to 0.8.
Since collective classification is all about finding the label for all nodes including outliers or noisy samples, there is a need to not to consider sparsity as missing information as these authors have done and to see how ALFNET behaves then.

Musings and trials of a PhD tyro

Friday, December 30, 2011

Active Learning for Networked Data

No comments:

Post a Comment