Musings and trials of a PhD tyro: 2011

Friday, December 30, 2011

Active Learning for Networked Data

Authors: Mustafa Bilgic et al

Conference: ICML 2010

Summary:

The paper presents an active learning algorithm called ALFNET that takes advantage of explicit network structure in the data (in the form of labels of neighboring nodes) for collective classification of nodes, by selecting only informative examples to be labeled to perform efficiently on test nodes whose label is not available. The underlying assumption of the paper is that labels of the linked nodes are correlated, exploiting which, it is possible to get better performance than the traditional approach of using only the attributes of the nodes.

Highlights/Strengths:

Elegant combination of key concepts in machine learning like active learning, semi-supervised learning, dimensionality reduction over networked data
Active learning for collecting classification effort is one of its kind because the current day network data are in terabytes and obtaining label information for all nodes is impractical. Thus, this combination help reduce the cost of labeling significantly, without compromising on the accuracy.
P-values plots and t-tests, albeit at 0.1 significance level, are quite informative for comparative studies.
Dimensionality reduction of sparse binary feature vectors to get better accuracy.
Use of clustering as the initial step to logically separate the nodes in the network to have balanced training set and label acquisition from thereafter. Intuitively, it appears that independence of the label of a node on the attributes of non-neighbor nodes works because of this step. However, use of majority class in a cluster indirectly brings in contribution from non-neighbors too.
Because of the use of disagreement measure, ALFNET gives more importance to uncertain regions of the learning space.

Weaknesses:

Since the algorithm iterates over the all nodes of graph by approximating collective classification by local collective classification model, and since the network data are usually large, it is important to study the running time behavior of the algorithm and its convergence, but such an analysis is not done.
The number of experiments is not extensive enough to draw any observation conclusively. Also, the network datasets are not large enough in their original form and authors perform preprocessing to retain only connected components that further reduce the datasets' size. Experiments on such restricted datasets cannot be seen as generally applicable.
Consider their modeling assumption – if the labels of the neighboring nodes are known, then label of the node under consideration is independent of the attributes of neighbors and non-neighbors. Since they use semi-supervised learning to predict the labels of unobserved neighbors, a question arises as to what if the predicted labels are wrong and if so, how the error propagates from iteration to iteration.
It would have been informative to know the actual classes present in the Cora and Citeseer datasets. Given that papers are published in vast number of domains, it is not clear how small number of classes were found in these datasets.

Next steps / Questions for discussion:

It would be interesting to study the behavior of error propagation (wrong class labels by semi-supervised learning) to see if the iterates settle or converge, whatever the starting labels are.
The idea of extending this work to a directed graph is not straightforward because in that case dependency of nodes (and their labels) change when the direction of links change. While the notion of label or class is well-defined in scientific domain, it is not obvious in social network or biological domain. If a general approach for labeling can be obtained, it can be used to find communities in graph datasets.
While it appears on the surface that the network structure is used, the classifier CC, a key player in ALFNET, is fed only with aggregated measure of the neighborhood in the form additional attributes. It would be interesting to see how the ALFNET performs if a true relation learner is used in place of CC, since the reported accuracy results, even with 90% confidence, are not close to 0.8.
Since collective classification is all about finding the label for all nodes including outliers or noisy samples, there is a need to not to consider sparsity as missing information as these authors have done and to see how ALFNET behaves then.

Rogue spyware XP Home Security 2012 removal

Got into XP Security 2012 spyware problem when browsing for an old tamil movie and thanks to Bleeping Computer, could successfully fix it. Though there are only few steps involved, the whole process took hours, so you may want start a step and go ahead with other works to save time.

What's great about this solution - the fix is through free software. The only issue here is you need a second clean computer to download the files, take it the one that is infected and run them there.

Here is what I did:
(btw, I had a second clean computer)

1) Try to mute the pop-ups from the spyware by doing a fake registration by giving the activation codes available from pcrisk.com. But, if you are not able to open web browsers, then you have to somehow download registry fix software - pcrisk.com provides this fix. Try to download through start->run and type www.pcrisk.com/xp-fix. Fixing the registry entries is the important first step.

2) When internet itself does not work, you will have to follow the manual removal instructions to begin with - this is available from many websites; be sure to boot the system in 'Safe mode with Networking' for this. This is a tedious process but not so tedious as formatting and installing all softwares again. For me, manual removal was like a preprocessing step, since even after verbatim following those instructions didn't solve the problem. Basically you will have to some how fool the rogue software into believing that you are going to register and not doing anything to remove it.

3) pcrisk.com speaks about proxy fix too. I didn't have this problem, so didn't have to worry about it.

4) General tips for XP Security 2012 removal is available here

5) Once the internet connection is up, follow the steps I did from Bleeping Computer

6) Do not forget to download and install MBAM from cnet.com. Perform a full scan using MBAM and remove the rogue spyware.

That's it.

Monday, December 26, 2011

vi tips

1) To convert a dos text file to unix file, in the escape mode of vi editor, type:

:set ff=unix

This removes the additional carriage return "\r" at the end of each line

2) To search and replace, in the escape mode:
:.,$s/pat1/pat2/g

This replaces pat1 by pat2 in all occurrences starting from the current position of the character to the end of the file. The current position indicated by . after : and end of the file is given by $ sign after the comma.

3) In command mode, to move to the beginning of the next line, type '+' and '-' to move to the starting of the previous line

4) To remove space from the beginning of all lines in vi editor:

:%s/^b\+//g

5) To replace a bunch of characters to newline in vi editor, use \r for searching.
For example, (1,2,3), ('a','b','c'), ('.....) : this list of tuples all printed in one line can be split as
(1,2,3)
('a','b','c')
....
by typing in escape mode in vi editor:

:%s/), /\r/g

Note the space following ), characters. Note also that there is no need to escape \ in \r.

5) So, during the initial days of my cooking, I used to just copy and paste the recipes from web to my editor. This editor was Notepad during my W7 days and when I installed Mint 16, I naturally switched to vi. Because of this transition, I started seeing a lot of Windows specific metacharacters in the recipe file. They look like this: <96> or <97>. You cannot search for this combination of characters as they are not separate. For this, I found the solution from SuperUser by this user.

:%s/[\x96]//g

This syntax deletes all the occurrences. You can add additional characters as follows:

:%s/[\x96\x97]//g

It is important to have \x in the syntax to indicate you are searching for special characters in their hex code. If you know the exact equivalent character you may add it in the 'replace' part of the command.

Sunday, December 25, 2011

Linux Util 1

1) To paste one of the columns of a multi-column file as one other column in the same file:
cut -f3 -d' ' fname1|paste -d' ' - fname1 > fname2

This command takes the 3rd column of file fname1 and pastes it as first column and writes the content to file fname2. Rename the file fname2 to fname1 to avoid keeping duplicate copies. The hyphen in the paste command can be switched to last to make the 3rd column as the last column instead of first. Here the column separator is the space character given after the '-d' switch.

2) I forget to leave space around binary operator used in expr command in shell scripting.

3) I keep forgetting about the rules of double quotes and single quotes usage in shell scripting. If you give a single quoted word inside double quoted sentence, the single quote is not interpreted, but a $variable is substituted with its value.

4) To find all C/C++ program files and related header files:

find . -regex '.*\.$c\|h\|cxx\|cpp$' -print

This command fetches files with extension .c, .h, .cxx, .cpp and prints their path from the current directory. Please note the extra characters ".*." in the front. The second dot is escaped and so is the first open parenthesis and logical OR symbol '|'.

5) A modification to the above command is by replacing the last 'print' option to 'print0'. That is append number zero to print word. This will display all the file names in one line as against the previous output that prints one file name in one line.

6) If you want the contents of all these files in one go, give the following:

find . -regex '.*\.$c\|h\|cxx\|cpp$' -print0|xargs -o cat

7) To remove all spaces in the filenames of a directory, get inside the directory and type in command line:

rename 's/ //g' *

Here, the single space after first '/' is important. Replace '*' with other names if you do not want to apply this change to all files.