Craic Computing Tech Tips

A collection of computer systems and programming tips that you may find useful.
 
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Friday, January 27, 2012

Getting absolute paths with the Unix ls command

The standard Unix command ls lists filenames and directories in the specified directory. The default behaviour is to list just the filenames as including the full pathname would clutter the screen.

But sometimes you want the absolute paths. I need this all the time if I want to create a file containing a list fo filenames. The obvious command to get all the YAML files, for example, is:
$ ls -1 *yml
A.yml
B.yml

In order to get the full pathnames you need to use this:
ls -1 -d $PWD/*yml
/home/jones/A.yml
/home/jones/B.yml







Wednesday, January 25, 2012

Wolfram CDF Player and Chrome Browser

Wolfram Computable Document Format (CDF) is a way to embed interactive documents into web pages, in particular those that perform calculations in response to a user changing the parameters. For example you can create graphs of functions that will change as the function is modified. It is an extension of the Wolfram Mathematica software.

It looks really promising for some applications and you should check out their demonstrations - some impressive, some not so much.

You 'play' CDF files using a browser plugin - just like Flash - and these are available for all the current browsers.

I'm running it on Google Chrome on Mac OS X on a fairly recent laptop. It performs OK depending on the specific application and the amount of data it is asked to push around. But when you close that window or move to another page the CDF player process continues to run. In my case that was taking 5% of my cpu and 66MB of memory and it continued to do so for perhaps 10 minutes after the page was closed.

This sort of plugin drain on your cpu is fairly common - just look at everything going on in Activity Monitor when you are browsing an 'active' web page with ads, etc.

In Chrome you can go to Window -> Task Manager, select a process and End it - but that didn't appear to do anything in my case.

CDF looks very interesting but if it requires too many resources, and then fails to release them properly, then it is not likely to be broadly adopted. It is something to keep an eye on, for sure.


Wednesday, January 4, 2012

Disabling Spotlight (mds) on Mac OS X (Snow Leopard)

I run a lot of command line scripts on my laptop - some of which can run for hours. I want to continue using the machine for reading mail, etc., but I don't want any other intensive task sucking up the cpu cycles. So I shut down iTunes, don't watch any videos, etc.

But sometimes I see some other process taking all my cycles. The odds are that it is either something to do with Flash or it is a process called mds.

mds is the indexing software that powers Spotlight - the built in Mac search facility.

I suspect that when I'm generating gigabytes of data and hundreds of files in one of my compute jobs, mds is responding by trying to index them at the same time.

I don't use spotlight at all, so let's turn it off and see if that helps.

This turns it off:
$ sudo mdutil -a -i off 

This turns it back on:
$ sudo mdutil -a -i on

Turning it back on will presumably trigger a big mds run as it plays catch up, so run this command only when you can afford the cycles.


Thursday, November 17, 2011

strsplit in R

The strsplit function in the R statistics package splits a string into a list of substrings based on the separator, just like split in Perl or Ruby. The object returned is a List, one of the core R object types. For example:
> a <- strsplit("x y z", ' ')
> a
[[1]]
[1] "x" "y" "z"
> class(a)
[1] "list"

If you are not that familiar with R, like me, the obvious way to access an element in the list will not work:
> a[1]
[[1]]
[1] "x" "y" "z"
> a[2]
[[1]]
NULL

So what do you do? There seem to be two options:

1: You can 'dereference' the element (for want of a better word) by using the multiple sets of brackets
> a[[1]][1]
[1] "x"
> a[[1]][2]
[1] "y"
... but I'm not going to write code that looks like that !!

2: You can unlist the List to create a Vector and then access elements directly
> b < unlist(a)
[1] FALSE FALSE FALSE
> b <- unlist(a)
> b
[1] "x" "y" "z"
> class(b)
[1] "character"
> b[1]
[1] "x"
> b[2]
[1] "y"
Much nicer !





Wednesday, November 16, 2011

Running R scripts on the Command Line


Using the R statistics package via a GUI is great for one off tasks or for learning the language, but for repeated tasks I want the ability to create and run scripts from the UNIX command line.

There are several ways to do this:

R CMD BATCH executes R code in a script file with output being sent to a file.
$ R CMD BATCH myscript.R myoutputfile
If no output file is given then the output goes to myscript.Rout. There is no way that I know of to have it go to STDOUT. Passing parameters to your script with this approach is a pain. Here is an example script:
args <- commandArgs(TRUE)
for (i in 1:length(args)){
   print(args[i])
}

This is invoked with this command:
$ R CMD BATCH  --no-save --no-restore --slave --no-timing "--args foo=1  bar='/my/path/filename'" myscript.R tmp
In particular, note the strange quoting of the arguments, preceded by --args inside the outer quotes - it's not a typo!


That command produces this output in file 'tmp':
[1] "foo=1"
[1] "bar='/my/path/filename'"
All those '--' options are necessary ! Try leaving out --slave and --no-timing and you'll see why.

Thankfully there is a better option ...

Rscript is an executable that is part of the standard installation

You can add a 'shebang' line to the file with your R script, invoking Rscript, make the file executable and run it directly, just like any other Perl, Python or Ruby script.

You don't need those extra options as they are the default for Rscript, and you pass command line options directly without any of that quoting nonsense.

Here is an example script:
#!/usr/bin/env Rscript
args <- commandArgs(TRUE)
for (i in 1:length(args)){
  print(args[i])
}
Running this with arguments:
$ ./myscript.R foo bar
produces this output on STDOUT (which you can then redirect as you choose)
[1] "foo"
[1] "bar"
Much nicer - but we've still got those numeric prefixes. If you are passing the output to another progran these are a major pain.

The way to avoid those is to use cat() instead of print() - BUT you need to explicitly include the newline character as a separate argument to the cat() function
#!/usr/bin/env Rscript
args <- commandArgs(TRUE)
for (i in 1:length(args)){
  cat(args[i], "\n")
}
results in this output:

$ ./myscript.R foo bar
foo 
bar 


For the sake of completeness, you can run R scripts with a shebang line that invokes R directly. But Rscript seems to be the best solution.


If you want to pass command line arguments as attribute pairs then you need to parse them out within your script. I haven't got this working in a general sense yet. What I want is to pass arguments like this:
$ ./myscript.R infile="foo" outfile='bar'
But I'm not quite there yet...

















Deleting a File that starts with '-' on UNIX

Filenames that begin with 'special' characters like '-', '.' or '*' cause problems on Unix. Standard commands like ls or rm view the characters as signifying command options.

You don't typically create files with names like this but they can arise through errors such as cut and pasting text into your command line.

Simply escaping the character or quoting the filename does not work.

The solution is to use a longer path to the file - the easiest being a relative path to the same directory. 

If the filename is '--myfile' you will get an error like this:

$ ls --myfile
ls: illegal option -- -
usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]

But this works just fine:

$ ls ./--myfile
./--myfile


Plotting a simple bar plot in R

Here is my cheat sheet for loading data from a CSV file into the R statistics package, plotting one column of data as a bar plot and saving it as a PNG image.

My input file is simple CSV file with two columns :
Position,Entropy
1,0.2237
2,0.4051
3,0.1312
4,0.1312
[...]

I want to load this into R as a data frame, then plot the values in the second column (Entropy) as a bar plot, using the values in the first column as the labels for the bars.

First step is to use read.csv (a shortcut version of read.table)
> d <- read.csv('<your path>/myfile.csv')

I can use barplot directly on the data frame (d)
> barplot(d[,'Entropy'])

But the default plot options are not great, so I can add custom options to the call, such as a main title and labels for X and Y axes. I set the lower and upper limits for the Y axis to be 0.0 and 1.0 and use the values in the first column of the data frame as the labels for the bars on the X axis
> barplot(d[,'Entropy'], main="Entropy Plot", xlab="Position",
  ylab="Entropy", ylim=c(0.0,1.0), names.arg=d[,'Position'])

The plot is displayed on my screen and looks the way I want it. To save it out to an image file, I specify the plotting device ('png') and the output filename, repeat the plot and then close/detach the plotting device.
> png('<your path>/myfile.png')
> barplot(d[,'Entropy'], main="Entropy Plot", xlab="Position",
  ylab="Entropy", ylim=c(0.0,1.0), names.arg=d[,'Position'])
> dev.off()

This produces the following image:

There are endless configuration options to play with but this works for a quick, simple plot.

Here are the steps without the prompts for you to cut and paste as needed:

d <- read.csv('<your path>/myfile.csv')
png('<your path>/myfile.png')
barplot(d[,'Entropy'], main="Entropy Plot", xlab="Position", ylab="Entropy", ylim=c(0.0,1.0), names.arg=d[,'Position'])
dev.off()

You could put these into a text file and run it from your system command line like this:

$ R CMD BATCH myfile.R

R is an incredibly useful system but as an occasional user I find the syntax and command names/options  hard to learn. Hopefully this simple example helps you with the learning curve.

... and always remember - arrays in R start at 1, not 0 ...


Archive of Tips