A collection of computer systems and programming tips that you may find useful.
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Thursday, November 17, 2011

strsplit in R

The strsplit function in the R statistics package splits a string into a list of substrings based on the separator, just like split in Perl or Ruby. The object returned is a List, one of the core R object types. For example:
> a <- strsplit("x y z", ' ')
> a
[1] "x" "y" "z"
> class(a)
[1] "list"

If you are not that familiar with R, like me, the obvious way to access an element in the list will not work:
> a[1]
[1] "x" "y" "z"
> a[2]

So what do you do? There seem to be two options:

1: You can 'dereference' the element (for want of a better word) by using the multiple sets of brackets
> a[[1]][1]
[1] "x"
> a[[1]][2]
[1] "y"
... but I'm not going to write code that looks like that !!

2: You can unlist the List to create a Vector and then access elements directly
> b < unlist(a)
> b <- unlist(a)
> b
[1] "x" "y" "z"
> class(b)
[1] "character"
> b[1]
[1] "x"
> b[2]
[1] "y"
Much nicer !

Wednesday, November 16, 2011

Running R scripts on the Command Line

Using the R statistics package via a GUI is great for one off tasks or for learning the language, but for repeated tasks I want the ability to create and run scripts from the UNIX command line.

There are several ways to do this:

R CMD BATCH executes R code in a script file with output being sent to a file.
$ R CMD BATCH myscript.R myoutputfile
If no output file is given then the output goes to myscript.Rout. There is no way that I know of to have it go to STDOUT. Passing parameters to your script with this approach is a pain. Here is an example script:
args <- commandArgs(TRUE)
for (i in 1:length(args)){

This is invoked with this command:
$ R CMD BATCH  --no-save --no-restore --slave --no-timing "--args foo=1  bar='/my/path/filename'" myscript.R tmp
In particular, note the strange quoting of the arguments, preceded by --args inside the outer quotes - it's not a typo!

That command produces this output in file 'tmp':
[1] "foo=1"
[1] "bar='/my/path/filename'"
All those '--' options are necessary ! Try leaving out --slave and --no-timing and you'll see why.

Thankfully there is a better option ...

Rscript is an executable that is part of the standard installation

You can add a 'shebang' line to the file with your R script, invoking Rscript, make the file executable and run it directly, just like any other Perl, Python or Ruby script.

You don't need those extra options as they are the default for Rscript, and you pass command line options directly without any of that quoting nonsense.

Here is an example script:
#!/usr/bin/env Rscript
args <- commandArgs(TRUE)
for (i in 1:length(args)){
Running this with arguments:
$ ./myscript.R foo bar
produces this output on STDOUT (which you can then redirect as you choose)
[1] "foo"
[1] "bar"
Much nicer - but we've still got those numeric prefixes. If you are passing the output to another progran these are a major pain.

The way to avoid those is to use cat() instead of print() - BUT you need to explicitly include the newline character as a separate argument to the cat() function
#!/usr/bin/env Rscript
args <- commandArgs(TRUE)
for (i in 1:length(args)){
  cat(args[i], "\n")
results in this output:

$ ./myscript.R foo bar

For the sake of completeness, you can run R scripts with a shebang line that invokes R directly. But Rscript seems to be the best solution.

If you want to pass command line arguments as attribute pairs then you need to parse them out within your script. I haven't got this working in a general sense yet. What I want is to pass arguments like this:
$ ./myscript.R infile="foo" outfile='bar'
But I'm not quite there yet...

Deleting a File that starts with '-' on UNIX

Filenames that begin with 'special' characters like '-', '.' or '*' cause problems on Unix. Standard commands like ls or rm view the characters as signifying command options.

You don't typically create files with names like this but they can arise through errors such as cut and pasting text into your command line.

Simply escaping the character or quoting the filename does not work.

The solution is to use a longer path to the file - the easiest being a relative path to the same directory. 

If the filename is '--myfile' you will get an error like this:

$ ls --myfile
ls: illegal option -- -
usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]

But this works just fine:

$ ls ./--myfile

Plotting a simple bar plot in R

Here is my cheat sheet for loading data from a CSV file into the R statistics package, plotting one column of data as a bar plot and saving it as a PNG image.

My input file is simple CSV file with two columns :

I want to load this into R as a data frame, then plot the values in the second column (Entropy) as a bar plot, using the values in the first column as the labels for the bars.

First step is to use read.csv (a shortcut version of read.table)
> d <- read.csv('<your path>/myfile.csv')

I can use barplot directly on the data frame (d)
> barplot(d[,'Entropy'])

But the default plot options are not great, so I can add custom options to the call, such as a main title and labels for X and Y axes. I set the lower and upper limits for the Y axis to be 0.0 and 1.0 and use the values in the first column of the data frame as the labels for the bars on the X axis
> barplot(d[,'Entropy'], main="Entropy Plot", xlab="Position",
  ylab="Entropy", ylim=c(0.0,1.0), names.arg=d[,'Position'])

The plot is displayed on my screen and looks the way I want it. To save it out to an image file, I specify the plotting device ('png') and the output filename, repeat the plot and then close/detach the plotting device.
> png('<your path>/myfile.png')
> barplot(d[,'Entropy'], main="Entropy Plot", xlab="Position",
  ylab="Entropy", ylim=c(0.0,1.0), names.arg=d[,'Position'])
> dev.off()

This produces the following image:

There are endless configuration options to play with but this works for a quick, simple plot.

Here are the steps without the prompts for you to cut and paste as needed:

d <- read.csv('<your path>/myfile.csv')
png('<your path>/myfile.png')
barplot(d[,'Entropy'], main="Entropy Plot", xlab="Position", ylab="Entropy", ylim=c(0.0,1.0), names.arg=d[,'Position'])

You could put these into a text file and run it from your system command line like this:

$ R CMD BATCH myfile.R

R is an incredibly useful system but as an occasional user I find the syntax and command names/options  hard to learn. Hopefully this simple example helps you with the learning curve.

... and always remember - arrays in R start at 1, not 0 ...

Monday, November 14, 2011

Captain Beefheart Song Titles

Here is a silly project that I knocked out last week - a generator of Fake Captain Beefheart Song Titles

It was inspired by a comment on Gideon Coe's radio show on BBC 6 Music where he wondered if such a thing existed - it didn't - so I wrote one!

The 'algorithm', if you can call it that, combines words from real Beefheart song titles and a list of others that sound (to me) like they could be. The structure of the real titles is relatively simple with most of them following a few simple patterns. Some of the generated titles don't work but every so often it'll spit out a good one.

The site is built with Ruby and Sinatra, with some jQuery thrown in for the scrolling effect. It serves as a nice example if you are learning Sinatra. You can find the code at Github and the application is hosted on Heroku.

It's just a bit of fun, but writing a small, self-contained application like this is a great way to learn new technologies. Giving yourself a short time frame in which to develop and deploy a complete application is a great exercise in efficiency.

Friday, November 11, 2011

When jQuery $.ajax() keeps returning the same value on Internet Explorer

I just mentioned this in my last post, but it is important enough that I'm going to give it its own post!

If you have jQuery Ajax calls, such as $.ajax() , $.get(), etc., that are continually fetching data from a remote server, you may find this works on Firefox, Chrome, etc. but in Internet Explorer it keeps returning the same value.

IE has cached the first return value and thinks that you are making identical requests so it just uses the cached value and never contacts the server.

Turn off caching by adding a call to ajaxSetup  at the start of your script block and set cache to false.


  $.ajaxSetup ({  
     cache: false  

The fix is simple once you realize it but it took me a while to figure it out this morning. 

Debugging JavaScript is painful at the best of times. It's worse when you throw AJAX into the mix. So take small steps and test on multiple platforms right from the start. I need to learn that lesson.

Updating a DIV with jQuery and Sinatra

Here is a minimal example of how to update a DIV in a web page with content from a remote Sinatra server using jQuery AJAX.

On the Server end you want an action that generates some new content and just returns it as a string. Let's just get the current time:

get '/update' do

In the client web page you need to include the jQuery library:
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7/jquery.min.js"></script>

You need a named DIV that will receive the content

<div id='update-div'></div>

Then you need to create your script block. You need a function that updates the DIV. update_time() calls the jQuery get function with two arguments. The first is the URL of the server action. The second is a function that is called on the returned data and the third is a type of data that the server will return, which is 'text' in our case.

Here we update the named DIV with the string returned from the server. Then we use the setTimeout function to wait 5 seconds, and then we call the function itself to repeat the process. That instance fetches content from the server, waits 5 seconds, calls itself, and so on.

At the start of the script block we add a call to ajaxSetup and set cache to false. This seems to be necessary for Internet Explorer which would otherwise cache the first value returned and keep using that instead of fetching new values.

Finally we make the initial call to the function when the document is first loaded and ready.

  $.ajaxSetup ({  
     cache: false  

  $(document).ready(function() { update_time(); });

  // fetch text from the server, wait 5 secs and repeat
  function update_time() {
          function(data) {
            window.setTimeout(update_time, 5000);

The jQuery get function is a simplified wrapper around the ajax call.

That is all you need to have a DIV continually updated with remote content, in this case coming from Sinatra.

Latest version of jQuery on Google APIs

If a web page uses the jQuery library it usually makes sense to link to copy hosted on an external content delivery network (CDN). You don't have to keep an updated copy on your site, the client may already have this copy cached, and the CDN has high bandwidth connections. Google, Microsoft and jQuery themselves offer CDN hosted jQuery.

I use the Google version and call it with this line:

 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.0/jquery.min.js"></script>

That is fine, but what happens when a new version comes out? 

In these cases some sites will offer a 'latest' url that automatically points to the most recent version.

Google does not do that - but they offer something similar, and which might be slightly better.

If I want version 1.7.0 and nothing else then I use the link given above. 
If I want the latest version in the 1.7 series then I use:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7/jquery.min.js"></script>

And if I want the latest version in the 1 series then I use:

 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js"></script>

Note that you don't always want to use the most general form. New releases are not necessarily back compatible. But you have the choice. 
And note that you no longer need to include type="text/javascript" in the script tag anymore - it's the default.

Thursday, November 10, 2011

Sorting on multiple String keys in Ruby

Ruby gives you two ways to sort arrays of 'complex' objects - sort and sort_by

Consider this array of hashes

array = [{:key0 => 'foo', :key1 => 'bar'},{:key0 => 'hot', :key1 => 'cold'}, ... ]

sort_by is the most compact form. To sort on :key0 in ascending order you write:

array.sort_by{ |a| a[:key0] }

sort is a little more verbose, but offers more flexibility:

array.sort{ |a, b| a[:key0] <=> b[:key0] }

If you want to sort strings in descending order then you switch the a and b assignments:

array.sort{ |a, b| b[:key0] <=> a[:key0] }

You can use sort_by with multiple keys if you include them in an array:

array.sort_by{ |a| [ a[:key0], a[:key1] }

But if you are using strings and you want one of the keys sorted in descending order then you need to use sort.

In this example the desired ordering is to first sort on :key0 in descending order and then on :key1 in ascending order:

array.sort do |a,b|
  (b[:key0] <=> a[:key0]).nonzero? ||
  (a[:key1] <=> b[:key1])

Not the most elegant or concise piece of code, but it does the job.

Wednesday, November 9, 2011

Manipulating Model/Table/Controller Names in Rails

There are some very useful methods that operate on the names of your models, tables and controllers in Rails. I use them infrequently enough that I have to relearn them every time - so I'm putting them in here:

Given a string with the name of a model (e.g. Project), how can I create an ActiveRecord call on the fly using that model?

str = 'Project'
project = (str.constantize).find(1)

How can I get the name of the corresponding controller ?

controller = str.tableize

Given a string with the name of a controller, how can I get the name of the Class/Model ?

str = 'sales_contact'
model_name = str.camelize

I the string is plural:

str = 'sales_contacts'
model_name = str.classify

How do I do the reverse operation?

str = 'SalesContact'
name = str.underscore

Parsing 96 / 384 well Plate Maps

In molecular biology and related fields, we use rectangular plastic plates with 96 or 384 wells for many purposes, allowing us to run many experiments in parallel. In particular, they are the standard format for tasks involving laboratory robotics, such as fluid dispensing, DNA sequencing, mass spectrometry, etc.

In setting up a plate of samples, it is common for a researcher to use Excel to create a 'plate map' - a table of 8 rows and 12 columns, for a 96 well plate, with each cell holding information on the sample in that well.

Converting this into a list of individual wells for input into an instrument or a piece of software should be straightforward but variability in manually created plate maps can be an issue. Extraneous text and the placement of the table in the spreadsheet are just two of them.

I have to deal with this problem every year or two and so I decided to write a utility script that implements this conversion and offers several options for the output file.

It is not intended as a complete solution for any one application. Rather it is a starting point that I can customize as needed.

I know other people have to deal with the same type of data, so I've made the script as general as possible and put it up on github under the MIT license. Please adapt this as you see fit for your needs.

The script can take CSV or Tab delimited files as input, containing 96 or 384 well plate maps. The output is a table with one well per row and the columns : plate, row, column, value. Options allow you to specify CSV or Tab delimited as the output format, whether to parse the input in row or column major order, and whether or not to output empty cells in the plate map.

The code is written in Ruby and you can find it at https://github.com/craic/plate_maps

Let me know if you find it useful

Archive of Tips