Thursday, July 19, 2012

Setting up Apache Solr on Mac OS X

Apache Solr is an incredibly flexible and capable text search engine that you can hook into other applications.

For a beginner it can appear to be very daunting due to the number of configuration options and when you start to read the documentation this feeling tends not to improve. But if you are using Mac OS X (a recent version) and have the homebrew package manager installed then the process is not too bad, at least to get you up and running.

These steps will help you get Solr installed and then load in data extracted from HTML, PDF and Word files.

In my configuration I have Mac OS X Lion and current versions of Java and homebrew.

1: Install Solr with homebrew - (brew update is always a good idea before a complex install)
$ brew update
$ brew install solr

This installs the software in /usr/local/Cellar/solr/3.6.0 (or whatever your version is)

2: Go to the example directory and start up Solr
$ cd /usr/local/Cellar/solr/3.6.0/libexec/example
$ java -jar start.jar
This will start up the Solr server using Jetty as the servlet container, which is just fine for our testing. Do not worry about using Tomcat etc until you are comfortable with a working Solr set up.
It is usually not good form to work inside a distribution directory - but for initial testing you should do this.
You will see a load of verbose output from Java in your terminal window, aside from any real errors, just ignore this.

3: Verify that the server is running
Browse to http://localhost:8982/solr/admin
You should the Administration interface with a gray background. There is not a lot you can do as you have not yet indexed any data, but this shows that you are running.

4: Load some data into Solr
$ cd exampledocs
$ java -jar post.jar *.xml
Note that the Solr server MUST be running before you try and load the documents.

5: Run your first query from the admin page
Enter 'video' in the Query form and hit 'Search'. You should see the contents of an XML file returned to you with 3 documents.
This demonstrates that the server is working, that you can index XML documents and query them. Solr does not provide you with a nice search interface. The intent is that your Rails, etc application sends queries and then parses out the XML results for display back to the user.

6: Extract text from other document types
What I want to use Solr for is to search text extracted from web pages, Word documents, etc. But this where the Solr documents started to really let me down.
This parsing is done by calling Apache Tika which is a complex software package with the sole aim of text extraction. The interface between Solr and Tika used to be called SolrCell and is now called the ExtractingRequestHandler. Don't worry about any of that for now! The Solr distribution has everything you need already in place - you just need to know how to use it...


Do not set up a custom Solr home directory for now. That is what the documentation suggests, it makes perfect sense but nothing will work if you do.


Save a few HTML files that contain a good amount of text into a temporary directory somewhere.

To load the contents of a file into the server and have it parse the text you need to use 'curl' to POST the data to a specific URL on the server. In this example my file is called index.html and I have cd'ed to the directory containing the file
$ curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true&uprefix=attr_&fmap.content=attr_content" -F "myfile=@index.html"

There are several things to note with this URL...
The server URI is http://localhost:8983/solr/update/extract
The parameters are
literal.id=doc1
commit=true
uprefix=attr_
fmap.content=attr_content

Literal.id provides a unique identifier for this this document in the index (doc1 in this case)
uprefix=attr_   adds the prefix 'attr_' to the name of each tag in the source document
fmap.content=attr_content  specifies that the main text content of the page should be given the tag attr_content
commit=true actually commits the parsed data to the index

And then note how the file to be parsed is specified... as a quoted string passed with the -F flag to curl. The quoted string consists of a name for the uploaded file (myfile) followed by the path to the file preceded by the Ampersand character (@).

You can ignore what these mean for now with the exception that each document you load must have a unique document ID.

7: Verify that the text has been indexed
Go to the Admin interface and click 'Full Interface'. Set the number of rows returned to large number (50) and then search with the query *:*   - this will return all the records in the index.

You should get back an XML page with the example data plus the text derived from your HTML file at the bottom.

8: Try loading other data types
Tika knows how to parse Word, Excel, PDF and other files.
Be aware that there may not be much text in certain files. This is especially true of PDF files which although they appear to have text when displayed, may have that text stored as an image.

9: Explore the example directory

Everything I've shown here should 'just work' as long as you worked in the example directory. In there you will see a 'solr' subdirectory which contains conf and data directories. The data directory is where the index files reside. In the conf directory you will see a solrconfig.xml file. This is the main location for specifying the various components that your Solr installation uses. The default values happen to work fine until you create your own 'Solr Home' directory elsewhere.

The problem is that the solrconfig.xml file contains various relative paths (like ../../lib) which will not work if you create a solr directory in an arbitrary location. Knowing this you can update those paths pretty easily, but if you created a custom solr home then you are in for a lot of frustration trying to figure out why nothing works. I went through that process - you shouldn't have to...

That should be enough to getting you started with Solr. The next steps are to link it to your web application, to load in a lot more data and to move it to a production servlet container like Tomcat.















Wednesday, July 18, 2012

Installing Apache tomcat on Mac OS X Lion using homebrew

It is hard to overestimate how much I am grateful for the homebrew Mac OS X package manager.

Today I needed to setup Apache Tomcat on a OS X Lion machine. I already have homebrew installed and have used it to install a bunch of stuff.

All that is needed is:
$ brew install tomcat

I'll admit that this failed the first time that I tried it but running 'brew update' and then 'brew install tomcat' sorted that out - I guess if you've not used it for a while you should run 'brew update' as it is still evolving quite rapildly.

That installs the code into /usr/local/Cellar/tomcat/7.0.28

To start/stop the server from the command line you use the catalina shell script
/usr/local/Cellar/tomcat/7.0.28/bin/catalina run
[...]
/usr/local/Cellar/tomcat/7.0.28/bin/catalina stop


When it is running you can go to http://localhost:8080 and see a default tomcat information page with lots more info.


For me that default port is a problem as it is also the default for my nginx installation. But changing it is easy enough. Edit /usr/local/Cellar/tomcat/7.0.28/libexec/conf/server.xml and replace the instances of 808 with your preferred port. Rerun catalina and then go to your preferred URL.


For my purposes I want to start and stop the server manually but check the documents for how to link it to an instance of Apache.



Friday, July 6, 2012

Downloading CSV files in Rails

Providing a way to download data as a CSV file is a common feature in Rails applications.

There is a nice Railscasts episode on the topic here:  Exporting CSV and Excel

I prefer to use a view template to generate my CSV as it gives me a lot of control on the fields that go into the file. But the standard way of invoking this from the controller does not provide a way to specify the filename for the downloaded file


  respond_to do |format|
    format.html
    format.csv 
  end 

With a Show action, for example '/posts/25.csv', the downloaded file would be called '25.csv' which is not useful.

In a response to this Stackoverflow question, Clinton R. Nixon offers up a nice solution.

He has a method called render_csv in his application controller that takes an optional filename. Before calling regular render on your template, it sets several HTTP headers - most importantly a Content-Disposition header with the desired filename. It adds the '.csv' suffix for you and uses the action name as the default if no name if supplied.

With this in place you modify your controller like this


  respond_to do |format|
    format.html
    format.csv { render_csv('myfile') }
  end

Very nice and very useful...


Using Sass in Rails 3.0

Sass is the default CSS preprocessor in Rails3.1 and it is tied into the asset pipeline machinery that first appears in 3.1.

I am migrating a complex app from 3.0 to 3.1 and I want to use Sass in the 3.0 production version before I make the big jump to 3.1. So far I have been using Less and the more plugin.

I recently wrote up my experience moving from Less to Sass syntax but here I want to cover the Rails side of things.

Here are the steps involved in my Less to Sass transition in Rails 3.0.5

Part 1: Remove all traces of Less

1: Remove the less gem from your Gemfile
2: Rename the app/stylesheets folder to something else
3: Move or delete the Less plugin in vendor/plugins/more so that the app doesn't see it on start up

Part 2: Set up Sass

4: Add the sass gem to your Gemfile ( gem 'sass' )
5: Run 'bundle install'
6: Create a sass directory under public/stylesheets
7: Leave any straight css file in public/stylesheets
8: Place any SCSS files in the sass directory - these could be new files or converted Less files - give them the .scss suffix
9: Remove or rename any matching css files in the parent directory
10: Restart your server and browse to your site

All being well Sass has worked in the background, created new versions of the CSS files and put them in public/stylesheets. Look for those files and test out the operation by, say, changing a background color in the SCSS file, saving the file and reloading the page. You should see the change implemented.

There are various options that you can set in your environment.rb file or similar. But you don't need any of these. Likewise you don't need to run any rake tasks or explicitly setup sass to watch certain files. It just works.

With this set up you can now get comfortable with Sass such that moving to 3.1 and the Asset Pipeline should be straightforward.

Thursday, July 5, 2012

Histograms in Mac OS X Numbers

I am always creating histograms based on CSV files where the first column is the category and the second is the value - like this:

year,count
2008,2
2009,10
2010,3
2011,6
2012,21

I want to load this into Excel or Numbers and create a bar chart where each bar is labeled with the year on the X axis and the height of the bar is the value for that category.

Load in the CSV file and you will have two columns of data.

The trick with Numbers is to specify the first column (year) as a Header column.

Go to the first column header (A) and use the pull down menu to select 'Convert to Header Column'. That makes the background for these cells gray and makes the text bold.


Now go to the first row header (1) and use that pull down to select 'Convert to Header Row'.


Select the entire table by clicking the corner cell at the top left, then go to 'Charts' icon in the toolbar and select your preferred type of bar chart. You will see the chart appear with the Years as the X axis and the column header(s) as the data series labels.


If you type data directly into a new Numbers sheet then the headers will be set up automatically, but that does not happen when you import a CSV file.



Wednesday, July 4, 2012

Rails 3 YAML parse error and RedCloth


I moved a Rails 3.0.5 application to a new server with ruby 1.9.2, under rbenv. It had been working fine before but now I got this:

$ rails server
/Users/jones/.rbenv/versions/1.9.2-p290/lib/ruby/1.9.1/psych.rb:148:in `parse': couldn't parse YAML at line 183 column 9 (Psych::SyntaxError)
from /Users/jones/.rbenv/versions/1.9.2-p290/lib/ruby/1.9.1/psych.rb:148:in `parse_stream'
from /Users/jones/.rbenv/versions/1.9.2-p290/lib/ruby/1.9.1/psych.rb:119:in `parse'
from /Users/jones/.rbenv/versions/1.9.2-p290/lib/ruby/1.9.1/psych.rb:106:in `load'
from /Users/jones/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/RedCloth-4.2.2/lib/redcloth/formatters/latex.rb:6:in `<module:LATEX>'
from /Users/jones/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/RedCloth-4.2.2/lib/redcloth/formatters/latex.rb:3:in `<top (required)>'
[...]


There are a lot of posts about this error on the web. Some recommend specifying the 'syck' Yaml engine in the boot.rb file or messing with your libyaml setup. This did not work for me.

The solution turned out to be simple. You can see from the error message that the error is coming from the RedCloth gem. Previously I had been using version 4.2.2 (look in Gemfile.lock) even though I had not explicitly set that version.

On the new system I had version 4.2.9 installed. When I set this explicitly in the Gemfile I could start up the server just fine, after running 'bundle install'.

gem 'RedCloth', '>= 4.2.9'