Musings and trials of a PhD tyro

Wednesday, February 12, 2014

Latex tips 1

So, I am starting a new post for Latex related findings. My second post in 2014, thanks to getting prep-ed for KDD2014. I am surprised I have missed to update this blog on my Latex journey.

Here is an awesome tool to find a symbol in Latex:

Detexify

Linux Util 6

I inadvertently deleted the previous contents of this post. Blogger sucks :( would not allow me to revert the contents. I lost one of my precious search on the nmap command.

1) To add a column of numbers and print the sum:

awk '{sum += $2} END {print sum}' temp

2) To list the contents of a local directory when inside a ftp connection:

!ls

In words, exclamatory sign followed by the actual command.

3) If you are transferring a lot of files through mput or mget and do not have the patience to hit the key 'y', then while opening the ftp connection:

ftp -i hostname

Interestingly, I would have thought -i would enable interactive mode, here is just the reverse.

4) If you have to convert from Unix time stamp to human-readable date form and vice versa, the following helps:

date -d @915149280 (to get readable date format from epoch seconds)

date +%s -d"Thu Nov 1 12:50:00 2013" (to get the unix timestamp in seconds from normal date format)

I literally stumbled upon the second command while reading a related question and its answer from SO. This answer was both simple and complicated in its own way. I thought I am better with vim search and replace. Well, I thought wrong. So, here is the next tip. Stack Overflow rocks!!

5) If you have text file with full human-readable date string sitting between the columns, then use vim to replace them all to epoch time using the following search string.

An example line looks like this:
Jey tty8 Thu Nov 1 12:50:57 2012 - Thu Nov 1 12:51:21 2012 (00:00)

Search command (in escape mode in vim):
:%s/\v\w+\s\w+\s\d+\s\d+:\d+:\d+\s\d+/\=system('date +%s -d"'.submatch(0).'" | tr -d "\n"')/g

After replacing:
(Jey tty8 1351788657 - 1351788681 (00:00)

As you can see, after the 'system' word, sits the command I explained in point 4 above. I learnt a lot of new things from this search regular expression.
a) use of the submatch(0): this matches what you went looking for in the first half of the expression
b) tr is used for translation. In this case, it is used to delete the trailing "\n" that comes with executing the system command
c) \s is for matching a single space. What was surprising was the "+" symbol after the characters 'w' and 'd'. Without them, the command does not work.

Friday, September 13, 2013

Grace Hopper Conference

This is going to be a Life. Changing. Experience. Can't wait any more!!!

Wednesday, July 31, 2013

Linux util 5

Whew!! My 5th linux util post...

1) Assuming you have SLURM basics, (well, I am just a beginner, so posting this very simple solution for the problem I faced), to force salloc to release the job allocation, find the process of that command using

a) ps ax|grep $your_name|grep salloc

If you have reserved resources by separate salloc command (without invoking the job using srun), then
b) kill -9 $pid

Or, If you have both salloc and srun in a single command,
c) kill -s HUP $pid

To see the signal identification for all the signals, use

man 7 signal

Just 'man signal' will take you to the C programming API for signaling.

2) Here is a simple way to find the number of words in each line of a file using awk:

awk '$0="line"NR": "NF' filename

Guess I got this tip again from SO, but this solution was not the accepted one.

3) If you pick only few lines output from make command to selectively fix the errors or warnings, (assuming you are in bash shell):

make -f Makefile clean; make -f Makefile 2 > &1 |grep 'error'

4) To identify the shell you are working on currently:
ps -p $$
From what I understand from the man pages, the -p option looks for pid list and $$ get the first process which is the shell itself. This prints the pid of the shell, terminal id etc. If you would rather want a concise output, then type:
echo $0
This was important to me change between bash and tcsh frequently when working on HPC machines of PNNL. The default login shell was tcsh, but I had do compilation and execution on bash. Most importantly, echo $SHELL did not help.

Friday, June 14, 2013

Linux Util 4

Hmm, I had to open this new post because I cannot have more than 20 tags for a post.

1) June 14, 2013: Unresponsive system administrator, connection speed is slow and I am sitting in my PNNL intern office, not knowing how to see pdf files created using pdflatex in Natty - this is for my first ever paper from my internship.

Home directory based xpdf installation in Natty failed because of some missing software or dependencies - only admin could fix this. After losing hope here, today I tried my luck with evince, a medium-weight software, but definitely lighter than acroread, in Google for solutions to
"cannot parse arguments, cannot open display" problem.

And Bingo, thanks to StackExchange, all I had to do was to copy the evince binary from global directory to my home bin directory and set the alias for evince in bashrc. Phew!!! Wish I had known this before. Long live user Gilles!! He answered his own question, incidentally. A proper fix for this could be done only by the admin, apparently and well, then in my case, I have to wait indefinitely for this.

My months long wait is finally over :). Is it not good that we have something called home directory in linux?

2) About 8 years back (around 2006, 2007, way long back, huh), firefox used to save the bookmarks automatically to a file called bookmarks.html, that I used to just copy to the desired location, all in the command line itself. Did not realize this has changed with the recent versions. Now, I am looking for one url (Aeolus/ganglia, specifically) available in my office natty to my PNNL box, but I find my bookmarks.html file (at ~/.mozilla/firefox/profilename/) is actually way smaller than the number of bookmarks I find in my browser. Thanks to MozillaZine, I understood that the default behavior is not to save the bookmarks and that you have to edit the config file to force firefox to do that. Here is what I did:

find the prefs.js file in .mozilla directory (should be available in the same location as your other profile specific files). The file header says you are not to edit the file, but I tried and it worked :)
type the following as the last line of that file:

user_pref("browser.bookmarks.autoExportHTML", true);

save it. Reopen firefox and close it. This will update the local files.
Now, open the bookmarks.html file and you will find all your saved bookmarks.

Some things to note:

Though you might have typed the 'user_pref' line towards the end, after the open-close of firefox, you will find that line sitting in a different place.

Please follow the instructions at the header of the file prefs.js and the mozilla's about:config page, if you are not comfortable editing the .js file directly.

3) I got a coupon from Quiznos for a free small sub and I didn't expect the machine I was given at PNNL will not have the option of selecting an area of the window to do screenshot. Here is what I did to print only the coupon (remember, my entire yahoo account screen was taken as screenshot).

Open the image in gimp and note the pixel information of the top, left and bottom, right corners. (Just hover your mouse to these places and you will see the pixel information in the bottom left corner in gimp changing.) Let us say you got, x1, y1 for top left and x2, y2 for bottom right.
Calculate the width and height of the image from this information (subtract the smaller left value with larger left value and right with right). x2-x1 is the width, y2-y1 is the height.
mogrify -crop widthxheight+x1+y1 imagename.png

That's all.....

Note the letter 'x' between the width and height values, it is not '*', the asterisk. Warning: the original image will be modified. Image quality was not modified at all.

Thanks to the stackoverflow question for this.

Wednesday, January 16, 2013

Linux Util 3

1) To kill all processes of a process tree, say a bunch of child processes spawned by a parent process (note that all the child processes have pids greater than that of parent's):

kill -9 -(ppid)

where ppid stands for parent pid. (just a negation symbol before the parent process id)

Another way to find the parent id (is same as process group id):

ps -eo "%p %r %c %a"

stackoverflow-page helped to solve this problem.

2) If you work by logging to your office machine (Linux server) from windows client machine using NoMachine (NX), and when you open lot of xterms, it is oftentimes confusing what each xterm window corresponding based just on the title of it. Just follow these steps to get new informative title (but this is active only for this session of your login)

unset PROMPT_COMMAND (assuming you have this env variable set in your .bashrc)
echo -ne "\033]0;title\007"
It is important to blindly follow the characters and their sequence in the echo command. Obviously, they mean something, but to keep the brevity, I am refraining from giving the explanation here.

3) If you want to do copy/paste operation in xterm,

   a) highlight the text using mouse
   b) use middle button/scroll wheel of mouse to paste or
   c) shift+insert to paste or when there is no middle button,or
   d) simultaneously click both mouse buttons to emulate middle-button click

Thanks to Ubuntu Forums for this help.

4) All along I have been using pdftops - psselect - ps2pdf combination to select pages from a pdf document, but Today (Jan 23, 2013) I learnt about pdftk from
LinuxJournal to select pages from pdf document directly without converting to ps first. How nice!!

pdftk A=100p-inputfile.pdf cat A22-36 output outfile_p22-p36.pdf

Input file is 100p-inputfile.pdf. Though the name sounds confusing, there is no need to prefix the name with 100p (100 pages). It is just for this particular example. This command selects pages from 22 to 36 and creates an output file with just those pages.

5) April 17, 2013: Found this tip while wanting to put all docx files of submitted student homeworks in one directory to upload to google drive and open them there in order not to miss pictures or other openoffice incompatible things.
Thanks to Nixcraft I could do this and this is the first time I really found a way to use xargs command.

find . -name "*.docx" -print0 | xargs -0 -I file mv file onlydocs/

6) To sync a local directory to a remote directory:

rsync -r -a -v -e "ssh -l yourname" local/directory/path remote.machine.address:remote/directory/path

You will be prompted to enter the password to the remote machine. If all else is correct, this should sync the files. Thanks to Nixcraft for this tip.

Thursday, July 5, 2012

More python

Well, in less than one day, I came to know about two important python based software systems:

1) scikit-learn : thanks to my surfing, profile creation and eventual posting of a question in Stack-Overflow, I got to know this nice tool when I was trying hard for better python based ML tools.
2) Cython came as a surprise when I was reading about the efficiency of graph algorithms in Networkx. So, there is a way out if my experiments are going to take lot of time, which I am expecting to happen. Hopefully I should have time to fiddle with Cython.

Monday, June 18, 2012

Life with Python

Well, though I am introduced to Python more than 6 months back, my day-to-day coding life is taking a turn towards Python direction and hence this post.

I am really tired of referring to Google every time I need to do something in Python. Phew!! how I wish I was introduced to Python in formal way! I am sure this is definitely not a good way of learning a language. Okay, acknowledgements first. I found fix to my problem here: Wrongsideofmemphis

And my issue was:
I was parsing DBLP XML file and could do it successfully with Expat library, but the problem was the parser always wrote the output to stdout. I had to do further processing of the output and thus require it to be redirected to a variable.

Saturday, February 18, 2012

Bayesian Multi-Task Reinforcement Learning

Authors: Alessandro Lazaric, Mohammad Ghavamzadeh

Conference: ICML, 2010

Summary:

The paper talks about multi-task Reinforcement Learning(RL) in an environment where the number of samples for a given task is limited in number because of the policy involved. This work assumes that the tasks share a similar structure and hence the corresponding value functions (vfs) are sampled from a common prior. Because of this assumption, the authors are able to do joint learning of vfs, in both cases of vfs from same task class or not. The paper stand out from others in its usage of Hierarchical Bayesian approach to model the distribution over vfs in parametric and non-parametric setting.

Strengths:

Generative models and inference algorithms for both cases of learning (symmetric and asymmetric) considered.
Modeling of value function similarity by HBM.
Different modes of learning: symmetric parametric and asymmetric non-parametric learning
Almost all the key machine learning areas like regression, Sampling, Bayesian modeling, Expectation Maximization, Dirichlet Process etc are touched upon here making it a paper with sound theoretical arguments.
Transfer of information from the joint distribution of vfs to learn the value function for new task.

Weaknesses:

Authors have compared three paradigms of STL, MCMTL and SCMTL but failed to compare, on the benchmark problems, how the other related techniques perform (given that they have quoted considerable number of related works) or even further, since the authors have significantly adapted ideas from literature, they could have given a comparison of BMTL with already published results.
The sampling techniques are computationally expensive and they are employed for asymmetric settings. Discussion of time complexity would have helped.
The paper appeared to be an amalgamation of already established techniques, combining them in some new combination and hence it had frequent referrals to old papers for all important parameters and results which made its reading hard. In that sense, the paper is not self contained.
No clear experimental setup to corroborate the ability to handle undefined number of classes.
It is surprising to see that the performance dips in all cases when the number of samples increase. While it is good to see that for limited samples and increase in number of tasks, the methods do well, the proposed method should be improved to take into account large number of samples, if available.

Next steps/Discussion:

Referring to figure 5c, it would be good to have discussion about why MCMTL fails when the number of tasks is limited in number.
It is clear that there is some kind of transfer learning happening while learning the value function of a newly observed task. It would be interesting to analyze under what paradigm of transfer learning this paper falls into.
It would be useful to know types of features usually considered for representing vfs in RL, esp for benchmark problems like inverted pendulum.
Since RL is predominantly used in Robotics, it would be good to know a real world example where the vfs are from same prior.
How is simple Gaussian processes different from GPTD?

Minimum spanning tree partitioning algorithm for microaggregation

Authors: Michael Laszlo,Sumitra Mukherjee

Journal: IEEE Transactions on Knowledge and Data Engineering
Volume: 17 Issue:7
Issue date: July 2005

Summary:

The paper discusses a heuristic way of clustering a set of points by partitioning the corresponding Minimum Spanning Tree (MST) of the underlying complete graph with a constraint on the minimum number of points per cluster such that there is minimal loss of information. The algorithm indirectly minimizes the sum of the within group squared error, a common objective of clustering algorithms, through three steps of MST construction, edge cutting and cluster formation.

Strengths:

The paper discusses time complexity analysis for each algorithm which helps to understand the scale of the problem.
Application to microaggregation introduces another problem to ML community – instead of specifying the number of clusters, the constraint is on the number of points in the cluster, which indirectly controls the number of clusters.
The heuristics are such that integration with any other method is straightforward and all the subroutines used in the paper use established algorithms like MST and data-structures like priority queue etc.

Weaknesses:

While the algorithm presents a new flavor of approaching from Tree/Graph partitioning perspective, it has compromised severely on the time complexity, by introducing the MST construction step. It is clear that the last two steps are present in most of the clustering algorithms in some form or the other. The authors argue that when the points have well-defined inherent clusters, their approach perform well, but fail to see that an iterative simple-to-implement k-means with appropriate modifications, can do the same operation.
Some of the running time reported are quadratic in number of the data points, which clearly make this algorithm unsuitable for large real world datasets.
Performance (information loss) is good only for data sets with well separated clusters, but in practice such well defined clusters rarely occur and as authors point out simple algorithms like D or C, have good minimum information loss for arbitrary shaped cloud of points.
SSE criterion is best suited for Gaussian type cloud of points; for other type of clustered points, author fail to discuss alternative objective criterion. It is not clear what are types of clusters found in the datasets like Creta discussed in the paper.
While euclidean distance is the most common measure, there are other types of distance measure that might be suited for microaggregation problem.
As authors themselves point out, there is no upper bound on the group size and they are forced to resort to combination of fixed sized methods with MST partitioning to control the upper bound.

Next steps:

When there are clearly separated clusters in the dataset, a simple clustering algorithms can be used to capture the clusters and then as follow-up step to control k, the techniques discussed in the paper can be used. This will check the increasing time complexity.
It would be an interesting experiment to reverse the steps of M-d and M-c methods and to apply the fixed-size methods first and apply MST on the resultant clusters. The time for their technique is shooting up because of running the MST on the whole dataset, which is shown by the authors too. Applying MST and Edge cutting on the reduced dataset (clusters) might help to improve efficiency.
Instead of heuristics, it is needs to be seen if directly solving the optimization problem of minimizing SSE with the given set of constraints yield better results.

Questions for discussion:

It would be interesting to see if the existing clustering algorithms like k-means, agglomerative clustering etc. could be adapted to include the fixed-number-of-points-per-cluster. For example, in many implementations, k-means is restarted with different initial seeds multiple times till a specified minimum number of points fall within each group.
When there are inherent clusters in the dataset, it is hard to fix k, the number of points in the cluster and in that case, the information loss cannot be controlled too. The problem of fixing the value increases when the clusters have soft boundaries – this is clear from the last few columns of table 7.

The table 6 clearly show that this technique gives large portion of over-sized clusters for the first class of datasets from sim_1 to sim_8, and also average size of those clusters is large. It brings up a questions as to how is it possible to get lower information loss (table 5 – last column) from this technique?
In section 3.3, instead of iteratively applying the procedure classifypoint to every data point, another way to utilize the existing information would be to use the descendant count value to capture the roots of the trees in the forest and assign the points during the traversal of each rooted tree to a particular group. Whether this alternative is efficient or not remains to be seen.

Semisupervised Learning Using Negative Labels

Semisupervised Learning Using Negative Labels

Authors: Chenping Hou et al.

Journal: IEEE Transactions on Neural Networks, Volume: 22 Issue: 3

Summary:

The paper talks about a special type of graph based semisupervised learning (SSL) for classification task where in data samples with no labels and negative labels are used during the learning phase, in addition to samples with known class labels, and in particular uses samples with negative labels – which depicts the information of whether a data sample does not belong to a given class - to improve classification accuracy. The paper provides an iterative label propagation algorithm, NLP, which after convergence, can be used to find the class label of data points with no label and negative label.

Strengths:

The method is novel in that it uses samples with no labels too during the the learning phase, when the traditional classifier uses only the labeled samples and has a separate testing phase to find the labels of test samples. The proposed method allows for intelligent integration of multiple negative labels for a given sample that, when combined with matrix approaches, allows for convenient model formulation which further aided in deriving closed form solution, and performing convergence analysis. The very formulation allows for out-of-sample extension, one of the key issues in ML community. The tuning of parameter σ, indirectly through τ, helped to keep classification accuracy under control for various varying values of τ (as shown in experimental results) because of the σ's dependence on k and geometry of data points. Significantly higher classification accuracy in all the experiments by NLP shows that learning over all points irrespective of the availability of their class information is important - this is also discussed by the authors while presenting classification results.

Weaknesses:

While the parameters A and Y are shown to be helpful in label propagation, careful analysis of algorithm clearly shows that there is no label propagation at all and, NLP, at its crux, sees no difference between samples with negative labels and no label, because of the way it operates and sets Y and A.
1. For example, consider the second point discussed under the rule for label updation. When the data sample does not belong to j-th class, the first term in equation 9 that helps for label propagation becomes negligible and the second term goes to zero, which effectively means no label propagation and towards the end of the same point, the authors explain about label propagation when there is no prior information about the data sample by considering it as an unlabeled sample.
2. In other words, label propagation happens only when the data sample is unlabeled or when there is no enough prior information from the NL sample. The term for label propagation helps to update/approximate labels of unlabeled samples through P and F, which are nothing but combinations of weight matrix and labels, with their values dominated by labeled samples from the neighborhood scaled by A and this makes NLP, an improved version of K-nearest neighbor. This equivalence with K-NN is also mentioned by the authors, albeit indirectly, while they mention one-shot NLP.
The effect of noise in the dataset is not discussed and it is to be expected that noisy samples might complicate the situation because of existence of different types of points.
While it was an innovative approach to get rid of σ, the authors could not make the algorithm completely parameter-proof as σ in turn depends on k, which is a parameter to the algorithm. This shows that parameters and their tuning is still an active field of research in ML.
It is clear from the section that discusses accuracies with different NL selections that mere increase in NL points do not guarantee increase in classification accuracy - in fact from table XII and XIII, it is clear that it dips. The paper lacks in-depth discussion about balance between number of TLs and NLs which is crucial for understanding of the algorithm over different datasets. From figure 2, it is clear that after a specified number of NL points, the classification accuracy saturates.

Next steps:

The most important observation pertaining to this algorithm is its good classification accuracy when compared to all others. As mentioned previously, each point brings in some information and that information when iterated over leads to better results. Such an approach needs to be verified for its applicability in other supervised learning approaches, where only the points with label information are used for training, effectively using only part of sample space. As is mentioned by authors too, more points lead to better accuracy.
It is clear from the tables XVI and XVII that NLP is the second highest in terms of computational time. As the labels of TL points are maintained consistently throughout the algorithm, and there are applications where TL points form significant portion of the dataset, the iterations over those points could be avoided, as in any case their labels are neither updated nor queried. This way the time could be significantly improved.
The relationship between number of TLs and NLs can be analyzed further, not just for improving NLP, but such analysis could bring forth other interesting observations.

Questions for discussion:

It would be interesting to check if the conventional one-vs-the rest multi-category classification tasks can be improved by incorporating additional samples with negative labels (that do not normally get incorporated in the training phase), as belonging to the 'the-rest' category.
The σ determination through k and d-cap can be further analyzed for use in many kernel based ML methods including, but not limited to, SVMs, Kernel PCA etc, as these methods require the use of brute-force cross-validation routine to help determine parameter values.
The authors, at multiple instances, speak about equivalence of TL points and NL points with more NLs. It would be interesting to see if the existing multi-category classification datasets can be further investigated to identify the classes the samples do not belong rather than giving exactly one class label per sample, and verifying if the classification accuracies could be improved by appropriate modification to the learner under consideration.

Wednesday, February 15, 2012

The Hindu : Columns / Harsh Mander : Barefoot - The other side of life

Tuesday, January 24, 2012

Symmetric Matrix

I had a 4x4 symmetric matrix to be entered in Octave and I was lazy to feed redundant elements :), convincing myself that it is waste of time. Well, I spent more than that 'to be wasted' time browsing for how to enter symmetric matrices economically in Octave. Believe me, it was worth it. You never get bored fiddling with matrices and of course, Octave/Matlab.

What I found was not a single step solution, for that matter I could not get a single step way to do it at all. I was impressed by this solution for its elegance. Not just that, this solution caters to my requirement of feeding unique elements (either upper triangular or lower) exactly once. Okay, now on to the steps:

1) Say your matrix is as below:
    16    4    8    4
    4   10    8    4
    8    8   12   10
    4    4   10   12

Obviously symmetric - let's call this A

2) V=[16 4 10 8 8 12 4 4 10 12];

V holds all the upper (lower) triangular elements one column (row) at a time.

3) Create a triangular matrix of 1 of the same size as your A matrix as follows:

M = triu(ones(4))
M =

   1   1   1   1
   0   1   1   1
   0   0   1   1
   0   0   0   1

4) Replace the ones with elements from V as follows:

M(M==1)=V
M =

   16    4    8    4
    0   10    8    4
    0    0   12   10
    0    0    0   12

The M==1 test returns the indices of all places that has 1 in it, and the assignment takes care of replacing them with elements from V.

5) Add M with itself but after transposing and taking only the lower triangular portion (note M is an upper triangular one):

M = M + tril(M',-1)
M =

   16    4    8    4
    4   10    8    4
    8    8   12   10
    4    4   10   12

As you can see, M is A.

-1 in the tril command makes sure that you are leaving the main diagonal when retrieving the lower triangular portion of the transposed M.

So, as it looks it is a 4 step algorithm, but if you can create a script with these steps in it, it is only one step.

Now to credits: I got this tip from one of the threads of mathworks.com :), where else?!

Monday, January 23, 2012

Infinite Loop trace using GDB

Well, here is another use of GDB.
To locate the set of statements that get executed infinitely, as usual for the first few steps,

1) compile with -g option
2) pass the executable name to gdb command
This is take you to gdb environment.
3) Type 'run your-program-arguments'
Allow sufficient time to make your program get caught in that loop
4) Type ctrl+C
This will send SIGINT to your executable. You will see statements not making much sense.
5) Type 'backtrace'
6) Locate the frame number corresponding to your program or function name.
If the function is called, say, two levels deep from main, you will see all those called functions, but we are interested only in the function with least frame number. In otherwords, your main will have the highest frame number.
7) Type 'frame #(that number)'
This might not still show source lines from your program. Patience is required here.
8) Keep typing 'next' command or 'n' till you see your source lines.
9) Once you are see lines of your program, phew, there you go. Keep typing 'n' and you will see a bunch of lines getting executed again and again. You may want to check your local variables and other variable values for the expected values.

Thanks to unknownroad.com , I was able to fix this issue with less sweat.

Thursday, January 12, 2012

Debug lessons

I was debugging using gdb and while the debug was going, I noted the changes to be made to the program in the program itself as comments and saved them - and, that was a MISTAKE. After such modifications, gdb internally sees the code which gave the executable but will show you only the new lines, which obviously we don't want. Phew!!

Tuesday, January 10, 2012

Load multiple files in matlab/octave

1) When you have multiples files of the same extension in the same directory, if you are looking for same operation to be done on them using a script, follow below for loading all of them at once to the Matlab/Octave environment:

files = dir('*.txt');
for i=1:length(files)
    eval(['load ' files(i).name ' -ascii']);
end

I am getting a warning message - ignoring extra args - still trying to figure out what it means.
By the way, I got this tip from Mathworks itself, exactly here.

2) As a small example, say the operation you do is finding size of the files. The below snippet shows how to use eval command on set of files with common prefix but different numeral suffix that are loaded as above.

for ii=1994:1999
s=['size(tst' int2str(ii) ')'];
eval(s)
end

3) While searching for such related Matlab commands, I came across this interesting and useful not-to-dos in Matlab:here

Friday, January 6, 2012

Linux Util 2

1) You can try to place a tab between the quotes if you first press "<CTR> v" then the "<TAB>" key:

cut -f1 -d'<ctrl>v <tab>' filename

This way even <tab> character can also made as delimiter in shell scripting.

2) If you have a file where first column is list of years with author names against each year, to find the count of papers in each year:

cut -f1 -d' ' fname |uniq -c

3) To find all empty lines in the standard input:
grep ^$ or grep -v .

4) Hmm, long live blogging: I was looking for grep command to search for a text recursively from a given directory and this blogger helped:

http://www.geekality.net/2011/04/12/unix-recursive-search-for-text-in-files/

grep -rl "your_text"

5 ) To redirect the output of time (real, sys, usr) command onto redirect_file, instead of console.

(time your_command your_args) 2>> redirect_file

6) To search for a file's contents in another file:

grep -f searchfile tobesearchedfile -n --color=auto

Can remove the color flag. Got this color tip from Nixcraft

7) To list only the duplicate lines of a file:

uniq -d filename

8) To unzip a file that is only compressed with bz2 use

bunzip2 filename.bz2

9) To unzip things thar are compressed with .tar.bz2 use

tar -xvjpf filename.tar.bz2

I got the tips 8 and 9 from linuxquestions.org

10) Many a times, to know if the server I am working in is serving a lot of users, I type w in the shell prompt. This is useful if you are going to run a resource-greedy process. Now, if you are interested to write/talk to them, you may want to know their full name. The following command helps to get their full names from an administrative database:

getent passwd "jey"|cut -d':' -f5

If you are curious, an encouraging fact is man page of getent is small :) !!
Happy linuxing!!

11) Wow, after about 5 months (it's Jan 8, 2013 today), I am glad to be updating my blog. Yup, it also means I have failed to keep track of my new findings in web in this database. Anyways, the following tip was something very important to me because I realized I am spending too much time 'Alt+tab'ing to find the window I want to work in while working from home, logging to my office machine (essentially all are xterms).

To change the title of the xterm window to indicate something useful (in this example, your_string):
unset PROMPT_COMMAND
echo -ne "\033]0;your_string\007"

I had to call unset first because by default in my .bashrc file, PROMPT_COMMAND stores the username, machine and current working directory. It is important to observe the combination of characters around "your_string" and follow them strictly.

Awk and Sed

1) To search and replace in vi, you need not enter inside the file at all:
sed does the magic for you from command prompt itself. Tip here.

2) To alter the order of the columns in a file and write the output to another file:
awk '{print $3" "$1" "$2}' filename > newfilename

The above example takes a 3-columned file and changes the order as 1st col as 2nd, 2nd as 3rd and 3rd as 1st and writes the output to newfilename. There is no space between the double quotes and column indices. $n stands for specific column in the file in awk terminology. I am not sure if giving same source filename as the destination is going to work.

3) To check if the first column of a line in a file is some character and if yes, print the entire line:
awk '{if ($1~/e/) print $0}' filename

Here the first column has only one character throughout. $1 indicates the first column. In this specific example, if the first column contains a single character 'e', the command prints the entire line, indicated by $0. The single character can be replaced by a word too and there is no need for quotes around the word or character to be searched. There is no need for space around '~' character.

The default delimiter is space character. If you want to give a different delimiter, add -F\<delimter>. If the field you are comparing is a number, can give numeric relational operators in if condition like ==, !=, > etc.

4) Similar to case3 above, but to give more than one condition in the if statement:
awk '{if (($1 ~ /e/) || ($1~/b/)) print $0}' filename

There are two important differences between case3 and case4. There are now enclosing '/' present around characters to be searched for and there is a logical OR sign '||'.

5) To search and replace character(s) from command prompt on a bunch
of files :

for((i=1;i<=30;i++)); do
sed -e 's/t:/ /g' filename$i > newfilename$i ; done

For 30 files that all have common prefix, search for character 't:' and replace them with a single space and write the output to a newfile with the same number

6) Aug 13, 2013: To add a column of numbers generated as a result of say some intermediate linux util commands,

grep 'nonworkingset' tobedel|cut -f2 -d:|awk '{sum+=$1} END {print sum}'

The first two commands take care of getting just the list of numbers. Note that if capitalize sum then you should do it uniformly across the entire command.

Thursday, January 5, 2012

octave

I finished coding an important segment of the C++ program, which is part of my research. The following are the tips I learnt:

1) Octave has strcat and strsplit functions in addition to plethora of other string handling functions. More here and here too

2) To save variables in a file and eventually start appending to it:
save('fname','varname','-append');
The single quotes are important. fname is the file name, varname is the variable name and the last switch is obvious. Octave does a nice job of saving variables with information like name of the variable, its size etc.

Blog visitor information

I experimented with

1) statcounter
2) sitemeter

for collecting information about visitor who visit my blog. Both offer free as well as paid version of their widgets. For better usage of those widgets, a good knowledge of javascript/html scripting would help a lot. I am planning to experiment with widget from Feedjit in future. Will update this post once I get to know more.

Novice bloggers like me need to read Google Webmaster tools page if they are interested to make their blog visible in searches.