A collection of computer systems and programming tips that you may find useful.
 
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Friday, November 12, 2010

Extracting Text from PDF documents on Mac OS X

There are various ways to extract the text from a PDF document. In Mac OS X there are two effective ways that cost nothing.

1: Open the PDF in Preview or Adobe Acrobat Reader, select all the text, copy and paste into your editor of choice.

This is a quick and easy solution for one or a few documents. But for a large number of documents it would quickly become tedious.

Neither application allows you to Save As Text from the menu, which is unfortunate.

I also see a significant difference between the pasted text when the original PDF has columns of text. Compare the same original text copied from Acrobat Reader:
MOLECULAR FORMULA C6492H10060N1724O2028S42
MOLECULAR WEIGHT 146.0 kDa
TRADEMARK None as yet
MANUFACTURER MedImmune
CODE DESIGNATION MEDI-563
CAS REGISTRY NUMBER 1044511-01-4
and Mac OS X Preview:
MOLECULAR FORMULA MOLECULAR WEIGHT TRADEMARK MANUFACTURER CODE DESIGNATION CAS REGISTRY NUMBER
C6492H10060N1724O2028S42 146.0 kDa None as yet MedImmune
MEDI-563 1044511-01-4
The first version is the right one. So copying from Acrobat Reader is the better solution.

2. To handle many documents you want a script of some sort that just extracts the text. You can do this with the Mac OS X Automator application. I'd not used this before but it gives you access to a whole load of functions/services. Well worth checking out. You'll find Automator in your applications folder. Open it up and...

- Select Application in the template window that appears on opening.
- In the Automator window you have two vertical panels on the left and a grey workspace on the right. Click the 'Actions' button in the top left.
- In the left hand column, under Library, click 'PDFs' to bring up a list of PDF related actions in the second column.
- Select, drag and drop 'Extract PDF Text' onto the workspace. A dialog will appear.
- Select the appropriate options - plain text or RTF, folder in which to output the text, etc.
- Save As... and give your new app a suitable name
- Quit Automator
- Drag and drop a PDF onto the icon for the app and it will extract the text and output that in a file in the directory you configured in the app

Simple! Now you can just drag and drop PDFs and they will be converted automatically.

Automator can do a whole lot more than this, especially if you have a series of steps that must applied to files (e.g. image processing). If you are into scripting, take a look at the way this all works. Right click on the app icon and 'Show Package Contents'.

You can also learn about this via this article by Mathias Bynens on a simple way to convert UNIX scripts into apps ... it's the same mechanism.

So now I've got an automated way to extract text... BUT... the script uses Preview for the extraction, and that still gives me the incorrect ordering of text that I showed in the above example... dang it... not a deal breaker but frustrating all the same


 

3 comments:

BakingWonderland said...

awesome tip

Unknown said...
This comment has been removed by the author.
Unknown said...

Hi,

All I get is a blank txt file.

thanks,

Andrew

Archive of Tips