Musings and trials of a PhD tyro: 2012

Thursday, July 5, 2012

More python

Well, in less than one day, I came to know about two important python based software systems:

1) scikit-learn : thanks to my surfing, profile creation and eventual posting of a question in Stack-Overflow, I got to know this nice tool when I was trying hard for better python based ML tools.
2) Cython came as a surprise when I was reading about the efficiency of graph algorithms in Networkx. So, there is a way out if my experiments are going to take lot of time, which I am expecting to happen. Hopefully I should have time to fiddle with Cython.

Monday, June 18, 2012

Life with Python

Well, though I am introduced to Python more than 6 months back, my day-to-day coding life is taking a turn towards Python direction and hence this post.

I am really tired of referring to Google every time I need to do something in Python. Phew!! how I wish I was introduced to Python in formal way! I am sure this is definitely not a good way of learning a language. Okay, acknowledgements first. I found fix to my problem here: Wrongsideofmemphis

And my issue was:
I was parsing DBLP XML file and could do it successfully with Expat library, but the problem was the parser always wrote the output to stdout. I had to do further processing of the output and thus require it to be redirected to a variable.

Saturday, February 18, 2012

Bayesian Multi-Task Reinforcement Learning

Authors: Alessandro Lazaric, Mohammad Ghavamzadeh

Conference: ICML, 2010

Summary:

The paper talks about multi-task Reinforcement Learning(RL) in an environment where the number of samples for a given task is limited in number because of the policy involved. This work assumes that the tasks share a similar structure and hence the corresponding value functions (vfs) are sampled from a common prior. Because of this assumption, the authors are able to do joint learning of vfs, in both cases of vfs from same task class or not. The paper stand out from others in its usage of Hierarchical Bayesian approach to model the distribution over vfs in parametric and non-parametric setting.

Strengths:

Generative models and inference algorithms for both cases of learning (symmetric and asymmetric) considered.
Modeling of value function similarity by HBM.
Different modes of learning: symmetric parametric and asymmetric non-parametric learning
Almost all the key machine learning areas like regression, Sampling, Bayesian modeling, Expectation Maximization, Dirichlet Process etc are touched upon here making it a paper with sound theoretical arguments.
Transfer of information from the joint distribution of vfs to learn the value function for new task.

Weaknesses:

Authors have compared three paradigms of STL, MCMTL and SCMTL but failed to compare, on the benchmark problems, how the other related techniques perform (given that they have quoted considerable number of related works) or even further, since the authors have significantly adapted ideas from literature, they could have given a comparison of BMTL with already published results.
The sampling techniques are computationally expensive and they are employed for asymmetric settings. Discussion of time complexity would have helped.
The paper appeared to be an amalgamation of already established techniques, combining them in some new combination and hence it had frequent referrals to old papers for all important parameters and results which made its reading hard. In that sense, the paper is not self contained.
No clear experimental setup to corroborate the ability to handle undefined number of classes.
It is surprising to see that the performance dips in all cases when the number of samples increase. While it is good to see that for limited samples and increase in number of tasks, the methods do well, the proposed method should be improved to take into account large number of samples, if available.

Next steps/Discussion:

Referring to figure 5c, it would be good to have discussion about why MCMTL fails when the number of tasks is limited in number.
It is clear that there is some kind of transfer learning happening while learning the value function of a newly observed task. It would be interesting to analyze under what paradigm of transfer learning this paper falls into.
It would be useful to know types of features usually considered for representing vfs in RL, esp for benchmark problems like inverted pendulum.
Since RL is predominantly used in Robotics, it would be good to know a real world example where the vfs are from same prior.
How is simple Gaussian processes different from GPTD?

Minimum spanning tree partitioning algorithm for microaggregation

Authors: Michael Laszlo,Sumitra Mukherjee

Journal: IEEE Transactions on Knowledge and Data Engineering
Volume: 17 Issue:7
Issue date: July 2005

Summary:

The paper discusses a heuristic way of clustering a set of points by partitioning the corresponding Minimum Spanning Tree (MST) of the underlying complete graph with a constraint on the minimum number of points per cluster such that there is minimal loss of information. The algorithm indirectly minimizes the sum of the within group squared error, a common objective of clustering algorithms, through three steps of MST construction, edge cutting and cluster formation.

Strengths:

The paper discusses time complexity analysis for each algorithm which helps to understand the scale of the problem.
Application to microaggregation introduces another problem to ML community – instead of specifying the number of clusters, the constraint is on the number of points in the cluster, which indirectly controls the number of clusters.
The heuristics are such that integration with any other method is straightforward and all the subroutines used in the paper use established algorithms like MST and data-structures like priority queue etc.

Weaknesses:

While the algorithm presents a new flavor of approaching from Tree/Graph partitioning perspective, it has compromised severely on the time complexity, by introducing the MST construction step. It is clear that the last two steps are present in most of the clustering algorithms in some form or the other. The authors argue that when the points have well-defined inherent clusters, their approach perform well, but fail to see that an iterative simple-to-implement k-means with appropriate modifications, can do the same operation.
Some of the running time reported are quadratic in number of the data points, which clearly make this algorithm unsuitable for large real world datasets.
Performance (information loss) is good only for data sets with well separated clusters, but in practice such well defined clusters rarely occur and as authors point out simple algorithms like D or C, have good minimum information loss for arbitrary shaped cloud of points.
SSE criterion is best suited for Gaussian type cloud of points; for other type of clustered points, author fail to discuss alternative objective criterion. It is not clear what are types of clusters found in the datasets like Creta discussed in the paper.
While euclidean distance is the most common measure, there are other types of distance measure that might be suited for microaggregation problem.
As authors themselves point out, there is no upper bound on the group size and they are forced to resort to combination of fixed sized methods with MST partitioning to control the upper bound.

Next steps:

When there are clearly separated clusters in the dataset, a simple clustering algorithms can be used to capture the clusters and then as follow-up step to control k, the techniques discussed in the paper can be used. This will check the increasing time complexity.
It would be an interesting experiment to reverse the steps of M-d and M-c methods and to apply the fixed-size methods first and apply MST on the resultant clusters. The time for their technique is shooting up because of running the MST on the whole dataset, which is shown by the authors too. Applying MST and Edge cutting on the reduced dataset (clusters) might help to improve efficiency.
Instead of heuristics, it is needs to be seen if directly solving the optimization problem of minimizing SSE with the given set of constraints yield better results.

Questions for discussion:

It would be interesting to see if the existing clustering algorithms like k-means, agglomerative clustering etc. could be adapted to include the fixed-number-of-points-per-cluster. For example, in many implementations, k-means is restarted with different initial seeds multiple times till a specified minimum number of points fall within each group.
When there are inherent clusters in the dataset, it is hard to fix k, the number of points in the cluster and in that case, the information loss cannot be controlled too. The problem of fixing the value increases when the clusters have soft boundaries – this is clear from the last few columns of table 7.

The table 6 clearly show that this technique gives large portion of over-sized clusters for the first class of datasets from sim_1 to sim_8, and also average size of those clusters is large. It brings up a questions as to how is it possible to get lower information loss (table 5 – last column) from this technique?
In section 3.3, instead of iteratively applying the procedure classifypoint to every data point, another way to utilize the existing information would be to use the descendant count value to capture the roots of the trees in the forest and assign the points during the traversal of each rooted tree to a particular group. Whether this alternative is efficient or not remains to be seen.

Semisupervised Learning Using Negative Labels

Semisupervised Learning Using Negative Labels

Authors: Chenping Hou et al.

Journal: IEEE Transactions on Neural Networks, Volume: 22 Issue: 3

Summary:

The paper talks about a special type of graph based semisupervised learning (SSL) for classification task where in data samples with no labels and negative labels are used during the learning phase, in addition to samples with known class labels, and in particular uses samples with negative labels – which depicts the information of whether a data sample does not belong to a given class - to improve classification accuracy. The paper provides an iterative label propagation algorithm, NLP, which after convergence, can be used to find the class label of data points with no label and negative label.

Strengths:

The method is novel in that it uses samples with no labels too during the the learning phase, when the traditional classifier uses only the labeled samples and has a separate testing phase to find the labels of test samples. The proposed method allows for intelligent integration of multiple negative labels for a given sample that, when combined with matrix approaches, allows for convenient model formulation which further aided in deriving closed form solution, and performing convergence analysis. The very formulation allows for out-of-sample extension, one of the key issues in ML community. The tuning of parameter σ, indirectly through τ, helped to keep classification accuracy under control for various varying values of τ (as shown in experimental results) because of the σ's dependence on k and geometry of data points. Significantly higher classification accuracy in all the experiments by NLP shows that learning over all points irrespective of the availability of their class information is important - this is also discussed by the authors while presenting classification results.

Weaknesses:

While the parameters A and Y are shown to be helpful in label propagation, careful analysis of algorithm clearly shows that there is no label propagation at all and, NLP, at its crux, sees no difference between samples with negative labels and no label, because of the way it operates and sets Y and A.
1. For example, consider the second point discussed under the rule for label updation. When the data sample does not belong to j-th class, the first term in equation 9 that helps for label propagation becomes negligible and the second term goes to zero, which effectively means no label propagation and towards the end of the same point, the authors explain about label propagation when there is no prior information about the data sample by considering it as an unlabeled sample.
2. In other words, label propagation happens only when the data sample is unlabeled or when there is no enough prior information from the NL sample. The term for label propagation helps to update/approximate labels of unlabeled samples through P and F, which are nothing but combinations of weight matrix and labels, with their values dominated by labeled samples from the neighborhood scaled by A and this makes NLP, an improved version of K-nearest neighbor. This equivalence with K-NN is also mentioned by the authors, albeit indirectly, while they mention one-shot NLP.
The effect of noise in the dataset is not discussed and it is to be expected that noisy samples might complicate the situation because of existence of different types of points.
While it was an innovative approach to get rid of σ, the authors could not make the algorithm completely parameter-proof as σ in turn depends on k, which is a parameter to the algorithm. This shows that parameters and their tuning is still an active field of research in ML.
It is clear from the section that discusses accuracies with different NL selections that mere increase in NL points do not guarantee increase in classification accuracy - in fact from table XII and XIII, it is clear that it dips. The paper lacks in-depth discussion about balance between number of TLs and NLs which is crucial for understanding of the algorithm over different datasets. From figure 2, it is clear that after a specified number of NL points, the classification accuracy saturates.

Next steps:

The most important observation pertaining to this algorithm is its good classification accuracy when compared to all others. As mentioned previously, each point brings in some information and that information when iterated over leads to better results. Such an approach needs to be verified for its applicability in other supervised learning approaches, where only the points with label information are used for training, effectively using only part of sample space. As is mentioned by authors too, more points lead to better accuracy.
It is clear from the tables XVI and XVII that NLP is the second highest in terms of computational time. As the labels of TL points are maintained consistently throughout the algorithm, and there are applications where TL points form significant portion of the dataset, the iterations over those points could be avoided, as in any case their labels are neither updated nor queried. This way the time could be significantly improved.
The relationship between number of TLs and NLs can be analyzed further, not just for improving NLP, but such analysis could bring forth other interesting observations.

Questions for discussion:

It would be interesting to check if the conventional one-vs-the rest multi-category classification tasks can be improved by incorporating additional samples with negative labels (that do not normally get incorporated in the training phase), as belonging to the 'the-rest' category.
The σ determination through k and d-cap can be further analyzed for use in many kernel based ML methods including, but not limited to, SVMs, Kernel PCA etc, as these methods require the use of brute-force cross-validation routine to help determine parameter values.
The authors, at multiple instances, speak about equivalence of TL points and NL points with more NLs. It would be interesting to see if the existing multi-category classification datasets can be further investigated to identify the classes the samples do not belong rather than giving exactly one class label per sample, and verifying if the classification accuracies could be improved by appropriate modification to the learner under consideration.

Wednesday, February 15, 2012

The Hindu : Columns / Harsh Mander : Barefoot - The other side of life

Tuesday, January 24, 2012

Symmetric Matrix

I had a 4x4 symmetric matrix to be entered in Octave and I was lazy to feed redundant elements :), convincing myself that it is waste of time. Well, I spent more than that 'to be wasted' time browsing for how to enter symmetric matrices economically in Octave. Believe me, it was worth it. You never get bored fiddling with matrices and of course, Octave/Matlab.

What I found was not a single step solution, for that matter I could not get a single step way to do it at all. I was impressed by this solution for its elegance. Not just that, this solution caters to my requirement of feeding unique elements (either upper triangular or lower) exactly once. Okay, now on to the steps:

1) Say your matrix is as below:
    16    4    8    4
    4   10    8    4
    8    8   12   10
    4    4   10   12

Obviously symmetric - let's call this A

2) V=[16 4 10 8 8 12 4 4 10 12];

V holds all the upper (lower) triangular elements one column (row) at a time.

3) Create a triangular matrix of 1 of the same size as your A matrix as follows:

M = triu(ones(4))
M =

   1   1   1   1
   0   1   1   1
   0   0   1   1
   0   0   0   1

4) Replace the ones with elements from V as follows:

M(M==1)=V
M =

   16    4    8    4
    0   10    8    4
    0    0   12   10
    0    0    0   12

The M==1 test returns the indices of all places that has 1 in it, and the assignment takes care of replacing them with elements from V.

5) Add M with itself but after transposing and taking only the lower triangular portion (note M is an upper triangular one):

M = M + tril(M',-1)
M =

   16    4    8    4
    4   10    8    4
    8    8   12   10
    4    4   10   12

As you can see, M is A.

-1 in the tril command makes sure that you are leaving the main diagonal when retrieving the lower triangular portion of the transposed M.

So, as it looks it is a 4 step algorithm, but if you can create a script with these steps in it, it is only one step.

Now to credits: I got this tip from one of the threads of mathworks.com :), where else?!

Monday, January 23, 2012

Infinite Loop trace using GDB

Well, here is another use of GDB.
To locate the set of statements that get executed infinitely, as usual for the first few steps,

1) compile with -g option
2) pass the executable name to gdb command
This is take you to gdb environment.
3) Type 'run your-program-arguments'
Allow sufficient time to make your program get caught in that loop
4) Type ctrl+C
This will send SIGINT to your executable. You will see statements not making much sense.
5) Type 'backtrace'
6) Locate the frame number corresponding to your program or function name.
If the function is called, say, two levels deep from main, you will see all those called functions, but we are interested only in the function with least frame number. In otherwords, your main will have the highest frame number.
7) Type 'frame #(that number)'
This might not still show source lines from your program. Patience is required here.
8) Keep typing 'next' command or 'n' till you see your source lines.
9) Once you are see lines of your program, phew, there you go. Keep typing 'n' and you will see a bunch of lines getting executed again and again. You may want to check your local variables and other variable values for the expected values.

Thanks to unknownroad.com , I was able to fix this issue with less sweat.

Thursday, January 12, 2012

Debug lessons

I was debugging using gdb and while the debug was going, I noted the changes to be made to the program in the program itself as comments and saved them - and, that was a MISTAKE. After such modifications, gdb internally sees the code which gave the executable but will show you only the new lines, which obviously we don't want. Phew!!

Tuesday, January 10, 2012

Load multiple files in matlab/octave

1) When you have multiples files of the same extension in the same directory, if you are looking for same operation to be done on them using a script, follow below for loading all of them at once to the Matlab/Octave environment:

files = dir('*.txt');
for i=1:length(files)
    eval(['load ' files(i).name ' -ascii']);
end

I am getting a warning message - ignoring extra args - still trying to figure out what it means.
By the way, I got this tip from Mathworks itself, exactly here.

2) As a small example, say the operation you do is finding size of the files. The below snippet shows how to use eval command on set of files with common prefix but different numeral suffix that are loaded as above.

for ii=1994:1999
s=['size(tst' int2str(ii) ')'];
eval(s)
end

3) While searching for such related Matlab commands, I came across this interesting and useful not-to-dos in Matlab:here

Friday, January 6, 2012

Linux Util 2

1) You can try to place a tab between the quotes if you first press "<CTR> v" then the "<TAB>" key:

cut -f1 -d'<ctrl>v <tab>' filename

This way even <tab> character can also made as delimiter in shell scripting.

2) If you have a file where first column is list of years with author names against each year, to find the count of papers in each year:

cut -f1 -d' ' fname |uniq -c

3) To find all empty lines in the standard input:
grep ^$ or grep -v .

4) Hmm, long live blogging: I was looking for grep command to search for a text recursively from a given directory and this blogger helped:

http://www.geekality.net/2011/04/12/unix-recursive-search-for-text-in-files/

grep -rl "your_text"

5 ) To redirect the output of time (real, sys, usr) command onto redirect_file, instead of console.

(time your_command your_args) 2>> redirect_file

6) To search for a file's contents in another file:

grep -f searchfile tobesearchedfile -n --color=auto

Can remove the color flag. Got this color tip from Nixcraft

7) To list only the duplicate lines of a file:

uniq -d filename

8) To unzip a file that is only compressed with bz2 use

bunzip2 filename.bz2

9) To unzip things thar are compressed with .tar.bz2 use

tar -xvjpf filename.tar.bz2

I got the tips 8 and 9 from linuxquestions.org

10) Many a times, to know if the server I am working in is serving a lot of users, I type w in the shell prompt. This is useful if you are going to run a resource-greedy process. Now, if you are interested to write/talk to them, you may want to know their full name. The following command helps to get their full names from an administrative database:

getent passwd "jey"|cut -d':' -f5

If you are curious, an encouraging fact is man page of getent is small :) !!
Happy linuxing!!

11) Wow, after about 5 months (it's Jan 8, 2013 today), I am glad to be updating my blog. Yup, it also means I have failed to keep track of my new findings in web in this database. Anyways, the following tip was something very important to me because I realized I am spending too much time 'Alt+tab'ing to find the window I want to work in while working from home, logging to my office machine (essentially all are xterms).

To change the title of the xterm window to indicate something useful (in this example, your_string):
unset PROMPT_COMMAND
echo -ne "\033]0;your_string\007"

I had to call unset first because by default in my .bashrc file, PROMPT_COMMAND stores the username, machine and current working directory. It is important to observe the combination of characters around "your_string" and follow them strictly.

Awk and Sed

1) To search and replace in vi, you need not enter inside the file at all:
sed does the magic for you from command prompt itself. Tip here.

2) To alter the order of the columns in a file and write the output to another file:
awk '{print $3" "$1" "$2}' filename > newfilename

The above example takes a 3-columned file and changes the order as 1st col as 2nd, 2nd as 3rd and 3rd as 1st and writes the output to newfilename. There is no space between the double quotes and column indices. $n stands for specific column in the file in awk terminology. I am not sure if giving same source filename as the destination is going to work.

3) To check if the first column of a line in a file is some character and if yes, print the entire line:
awk '{if ($1~/e/) print $0}' filename

Here the first column has only one character throughout. $1 indicates the first column. In this specific example, if the first column contains a single character 'e', the command prints the entire line, indicated by $0. The single character can be replaced by a word too and there is no need for quotes around the word or character to be searched. There is no need for space around '~' character.

The default delimiter is space character. If you want to give a different delimiter, add -F\<delimter>. If the field you are comparing is a number, can give numeric relational operators in if condition like ==, !=, > etc.

4) Similar to case3 above, but to give more than one condition in the if statement:
awk '{if (($1 ~ /e/) || ($1~/b/)) print $0}' filename

There are two important differences between case3 and case4. There are now enclosing '/' present around characters to be searched for and there is a logical OR sign '||'.

5) To search and replace character(s) from command prompt on a bunch
of files :

for((i=1;i<=30;i++)); do
sed -e 's/t:/ /g' filename$i > newfilename$i ; done

For 30 files that all have common prefix, search for character 't:' and replace them with a single space and write the output to a newfile with the same number

6) Aug 13, 2013: To add a column of numbers generated as a result of say some intermediate linux util commands,

grep 'nonworkingset' tobedel|cut -f2 -d:|awk '{sum+=$1} END {print sum}'

The first two commands take care of getting just the list of numbers. Note that if capitalize sum then you should do it uniformly across the entire command.

Thursday, January 5, 2012

octave

I finished coding an important segment of the C++ program, which is part of my research. The following are the tips I learnt:

1) Octave has strcat and strsplit functions in addition to plethora of other string handling functions. More here and here too

2) To save variables in a file and eventually start appending to it:
save('fname','varname','-append');
The single quotes are important. fname is the file name, varname is the variable name and the last switch is obvious. Octave does a nice job of saving variables with information like name of the variable, its size etc.

Blog visitor information

I experimented with

1) statcounter
2) sitemeter

for collecting information about visitor who visit my blog. Both offer free as well as paid version of their widgets. For better usage of those widgets, a good knowledge of javascript/html scripting would help a lot. I am planning to experiment with widget from Feedjit in future. Will update this post once I get to know more.

Novice bloggers like me need to read Google Webmaster tools page if they are interested to make their blog visible in searches.

Monday, January 2, 2012

two-dimensional array using C++ STL Vector

To create a two-D array of doubles using C++ STL templates:

I used vector of doubles as follows:

vector< vector <double> > arrayname;

I got this tip from yolinux. As can be seen, it is a vector of vectors. C++ STL templates come with a cost of cumbersome access to elements. Please bear in mind that random access is not easy with this.

If you want to do some matrix manipulation, I would suggest you declare a two-dimensional array using pointer to a array of pointers as follows (here, val is the arrayname):

double **val;
val = (double**) malloc(sizeof(double*) * listsize);
val[0] = (double*)malloc(sizeof(double)* listsize * numcols);
for (i = 1; i < listsize; i++)
val[i] = val[i-1]+numcols;

Basically, we dump all elements to the location pointed by first pointer in the array of pointers and make our life easier for ease of access to array elements by setting the other pointer locations in relation to the first one. When you don't do this, you will have to map two-d index values onto one-d. Don't forget to free the pointers after done with using them (in order):

free(val[0]);
free(val);

Thanks to Numerical Recipes in C, I learnt the correct way of declaring and allocating two-dimensional arrays in C/C++.