A collection of computer systems and programming tips that you may find useful.
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Thursday, March 21, 2013

SODA - Socrata Open Data API

Socrata is a company that helps organizations make their data available to outside users. They work with people like the World Bank and a number of city governments, including the City of Seattle.

You can access these public data sources via SODA - the Socrata Open Data API which provides a SQL-like query language (SoQL) that you submit via HTTP request to an API endpoint for that resource. The default response is JSON but you can get CSV, XML and RDF.

They provide quite a bit of documentation but it is a bit short on examples and I had to poke around a bit to get the right syntax for the data sources that I am interested in.

A very useful resource is the 'live' query console they provide for a test dataset, which lets you try various queries.

I access their resources from Ruby. I generate the URL, use open-uri to fetch the data and then parse the JSON response. Here is an example fo the process:

The data source I want to query list the Seattle Fire Department 911 calls. That page lets you create custom views and download the data. If you click on 'Export' you can see the column names used in the database as well as the url for the API endpoint.

I want to get all records in the past hour, so I need a query along the lines of 'where datetime > my_timestamp'. Specifically I need to build a URL similar to this:

http://data.seattle.gov/resource/kzjm-xkqj.json?$where datetime > my_time_stamp

This is not a real query just yet but note a couple of things...

The API endpoint is http://data.seattle.gov/resource/kzjm-xkqj.json

This version ends in '.json' and so it will return the response in JSON. If you want CSV then replace '.json' with '.csv' - same goes for XML and RDF.

Also note the dollar sign before the 'where' - it is easy to overlook - that signifies the following word is a SoQL clause and not the name of a column.

A big problem I ran into was figuring out the format for timestamps. The web page showing the table formats them as '03/21/2013 07:35:00 AM -0700' whereas the JSON response uses "datetime" : 1363872960. But neither of these works in a query... and the documentation on data types does not describe what you should use. Trial and error with that console page led me to the correct format of '2013-03-21 07:00:00' which is similar to the ISO 8601 format.

Also note that the time needs to be quoted... again, apparently not in the documentation

So my query should now look something like this:

http://data.seattle.gov/resource/kzjm-xkqj.json?$where datetime > '2013-03-21 07:00:00'

But that URL needs to be escaped before submission. That was the other issue - URI.escape() handles the spaces, quotes, etc but for some reason it does not escape the dollar sign. So I had to do that for myself.

Here is a chunk of Ruby code that gets all the 911 calls in the past hour:

require 'open-uri'
require 'time'
require 'json'

endpoint = 'http://data.seattle.gov/resource/kzjm-xkqj.json'
timestamp = (Time.now - (60 * 60)).strftime("%F %H:%M:%S")

query = "$where=datetime > '#{timestamp}'"

url = "#{endpoint}?#{query}"
url = URI.escape(url)
url.gsub!(/\$/, '%24')

puts "query: #{query}
puts "url: #{url}

json = open(url).read
puts json

result = JSON.parse(json)
puts result.length

That will show you the query, the escaped URL, the JSON returned and the number of records in the parsed JSON - enough to get you started. Different data sources will use different column names and perhaps different formats for date/time (always be aware of time zone issues).

Having a common API to many government/public data sources is great but the Socrata-backed sources that I've looked a so far don't seem to be quite as consistent as one would hope. This is an issue for the different data providers - it's not a Socrata issue.

I would hope that one could query the endpoint in some way and get a listing of all the column names and data types. There is one for SODA v1.0 but this now deprecated... so perhaps there is a newer version but it has not been documented.

Part of the answer lies in two custom HTTP response headers (X-SODA2-Fields and X-SODA2-Types) that are returned with your data. Here they are for the data source used here:

X-SODA2-Fields: [":updated_at","address","longitude",":id","latitude","incident_number","datetime","type",":created_at","report_location"]
X-SODA2-Types: ["meta_data","text","number","meta_data","number","text","date","text","meta_data","location"]
X-SODA2-Legacy-Types: true

You can see the headers if you use 'curl -i "your_url"' on the command line with your escaped url.


Unknown said...

These are great tips!
Palm Beach Paper Delivery Services

Unknown said...

Great post! I came across this entry as I'm trying to access the corresponding police 911 SODA API. I tried following the code, but inserted the police API Access Endpoint instead but received OpenURI::HTTPError: 400 Bad Request. Any ideas why?


Archive of Tips