A collection of computer systems and programming tips that you may find useful.
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Friday, November 12, 2010

Extracting Text from PDF documents on Mac OS X

There are various ways to extract the text from a PDF document. In Mac OS X there are two effective ways that cost nothing.

1: Open the PDF in Preview or Adobe Acrobat Reader, select all the text, copy and paste into your editor of choice.

This is a quick and easy solution for one or a few documents. But for a large number of documents it would quickly become tedious.

Neither application allows you to Save As Text from the menu, which is unfortunate.

I also see a significant difference between the pasted text when the original PDF has columns of text. Compare the same original text copied from Acrobat Reader:
MOLECULAR FORMULA C6492H10060N1724O2028S42
TRADEMARK None as yet
and Mac OS X Preview:
C6492H10060N1724O2028S42 146.0 kDa None as yet MedImmune
MEDI-563 1044511-01-4
The first version is the right one. So copying from Acrobat Reader is the better solution.

2. To handle many documents you want a script of some sort that just extracts the text. You can do this with the Mac OS X Automator application. I'd not used this before but it gives you access to a whole load of functions/services. Well worth checking out. You'll find Automator in your applications folder. Open it up and...

- Select Application in the template window that appears on opening.
- In the Automator window you have two vertical panels on the left and a grey workspace on the right. Click the 'Actions' button in the top left.
- In the left hand column, under Library, click 'PDFs' to bring up a list of PDF related actions in the second column.
- Select, drag and drop 'Extract PDF Text' onto the workspace. A dialog will appear.
- Select the appropriate options - plain text or RTF, folder in which to output the text, etc.
- Save As... and give your new app a suitable name
- Quit Automator
- Drag and drop a PDF onto the icon for the app and it will extract the text and output that in a file in the directory you configured in the app

Simple! Now you can just drag and drop PDFs and they will be converted automatically.

Automator can do a whole lot more than this, especially if you have a series of steps that must applied to files (e.g. image processing). If you are into scripting, take a look at the way this all works. Right click on the app icon and 'Show Package Contents'.

You can also learn about this via this article by Mathias Bynens on a simple way to convert UNIX scripts into apps ... it's the same mechanism.

So now I've got an automated way to extract text... BUT... the script uses Preview for the extraction, and that still gives me the incorrect ordering of text that I showed in the above example... dang it... not a deal breaker but frustrating all the same


Javascript Date.parse browser compatibility issue

Just got burned with a Javascript incompatibility between Firefox and Safari on the Mac...

Date.parse() takes a string representation of a Date and return the number of milliseconds since the epoch. It has always been able to take dates in IETF format, such as "Jan 1, 2010" but as of Javascript 1.8.5 it can handle ISO8601 format as well, such as "2010-01-01".

I use ISO8601 in all my applications and in Firefox 3.6.12 (on the Mac) this works fine:
> Date.parse("Jan 1, 2010 GMT");
> Date.parse("2010-01-01");
Note that the ISO8601 form assumes the GMT timezone, but the IETF form assumes local timezone unless you specify GMT.

The problem for me arose in Safari (5.0.2) on the Mac:
> Date.parse("Jan 1, 2010 GMT");

And just to muddy the waters further, here is the output on Google Chrome 7.0.517.44:
> Date.parse("Jan 1, 2010 GMT");
The ISO8601 date is handled OK but is returned in the local timezone.

I need a fail safe way to parse ISO8601 dates - what to do...? Here's what I came up with:
var date_str = '2010-01-01';
var iso8601_regex = /(\d{4})[\/-](\d{2})[\/-](\d{2})/;
var match = iso8601_regex.exec(date_str);
var date = new Date(match[1], match[2] - 1, match[3]);
var milliseconds = Date.parse(date); // -> 1262332800000 (local time)
milliseconds = Date.UTC(match[1], match[2] - 1, match[3], 0, 0, 0); // -> 1262304000000 (UTC/GMT)
Note the '- 1' with the month (match[2]) - the first month is 0 - go figure.
This produces the same results in Mac Firefox, Safari and Chrome - consistent behaviour - yes, really!

What a palaver...


Tuesday, November 9, 2010

Rails has_many :through associations and edit/update actions

Consider a basic has_many :through association
class Drug < ActiveRecord::Base
has_many :indications
has_many :diseases, :through => :indications, :uniq => true
Here a drug can be used to treat multiple diseases and any disease can be treated with multiple drugs (that side of the association is not shown). The 'indications' is a basic linking table with drug_id and disease_id.

In the drug#new and drug#edit forms you might use a select menu that allows you to select multiple diseases. I use the 'simple_form' gem for my forms and the 'association' method makes this trivial.
<%= f.association :diseases,
:collection => Disease.all(:order => 'name') %>
In order for this to work you need to add an attr_accessible called :disease_ids to your model. With that, simple form should handle all details needed to create and update the association.

But there is a problem with the edit/update actions when you want to deselect ALL diseases. If you do this in the form then no disease_ids parameter will get passed to your controller and so this column will not get updated. It is a classic issue with HTML form updates and applies to checkboxes as well.

The solution is to add a line to the updater action in your controller that sets the disease_ids parameter to an empty array if it does not exist:
  def update
params[:drug][:disease_ids] ||= []
@drug = Drug.find(params[:id])
This works fine - adds a bit of clutter to the controller but there you go...

However, this will break a basic functional test for the update action and you will get an error similar to this:
NoMethodError: You have a nil object when you didn't expect it!
You might have expected an instance of Array.
The error occurred while evaluating nil.[]
The problem stems from the basic stub for your object not passing a params[:drug] hash. Now, I'm not an expert on stubs/mocking so there may be a much cleaner way of fixing this, but I fix this by explicitly creating the needed parameters and passing them in the 'put' method.

Here is an example of a basic update test
  def test_update_valid
put :update, :id => Drug.first
assert_redirected_to drug_url(assigns(:drug))
and here is the modified one that will work
  def test_update_valid
put :update, { :id => Drug.first, :drug => { :disease_ids => [] } }
assert_redirected_to drug_url(assigns(:drug))

Of course I should be building out the tests to provide truly useful tests, but if you can't get beyond this step, the others don't really matter.

Hope this helps....


Disabling spell check in HTML forms

I work with DNA and protein sequences and I often have HTML forms with a textarea for entering sequence. Unfortunately my browser sees that text (e.g. 'agctagagctcgatagc') and decides that this is misspelled and underlines all the sequence text with a red dotted line... ugly...

In HTML5 you can now disable spell checking on textarea and text inputs using the option 'spellcheck' = 'false' - EASY!

Note that this is a HTML attribute, NOT CSS - so you have to set it in the form itself.

Browser support for the feature may vary. It works on Firefox and Safari on the Mac for sure.

A related attribute is 'contenteditable' that allows you to control whether specific parts of a textarea's content can be modified - like 'readonly' but with much more control.


Tuesday, November 2, 2010

Rails, Factory_girl and Bioinformatics - a gotcha

I've been using factory girl as a replacement for Fixtures in a new Rails application. It has been working well but I just stumbled across a BIG gotcha for my application.

I work on bioinformatics applications with DNA and Protein sequences and having a model with a column called 'sequence' is a natural choice.

The problem is that Factory Girl allows you to create objects in which a specific field is given a sequential id. For example 'user_1', 'user_2', etc. You set that up in your Factory definition like this:
Factory.define :mymodel do |f|
f.sequence(:name) {|n| "name_#{n}" }
When you have a model with a field called sequence you would define that like this:
Factory.define :mymodel do |f|
f.sequence(:name) {|n| "name_#{n}" }
f.sequence 'acgtacgtacgt'
You see the problem... Running a test with this in it brings up an error message like this:
  1) Error:
test: A Sequence instance should be valid. (SequenceTest):
NoMethodError: undefined method `call' for nil:NilClass
/Users/jones/.rvm/gems/ruby-1.9.2-p0/gems/activesupport-3.0.1/lib/active_support/whiny_nil.rb:48:in `method_missing'
The same problem will arise if you have a column called 'association'.

This is a bad design choice to mix up field names and methods. A better choice would have been to use a 'verb' such as f.generate_sequence(:name), which is much less likely to be used as a column name.

I can't see a way around this other than changing the name of the column in my model, which I am very reluctant to do.
Machinist is an alternative to Factory Girl, so that might be a solution - or hacking the factory_girl gem to change the name of the methods...

UPDATE: Factory Girl will allow alternate syntaxes - see the "Alternate Syntaxes" section. Not sure if its the best path for me but might get me over the immediate hurdle.

UPDATE 2: Just tried Machinist and that solves the first problem... however it fails if you have a column called 'alias' in a blueprint - the trick here is to precede the column name with 'self', i.e. 'self/alias' works.


Archive of Tips