Data Cats

James Smith

Data Kitten

Data is complicated stuff, but that’s nothing in comparison to actually trying to work out what it is in any sort of automated way.

When you find some, how do you know what it’s for, how it’s laid out, and how it’s licensed? Not content with a load of different metadata formats, there are also a bunch of different naming schemes, adding an extra layer of confusion.

As a developer wanting to create something that grabs datasets straight off the web, these differences in format are a pain, particularly when you’re making something like, oh, an Open Data Certificate, or a data previewer for git repositories.

So, to help our apps understand all these different formats, we’ve created a Ruby gem called Data Kitten (it’s a play on DCAT, the vocabulary that we’ve decided to use for naming conventions internally).

Data Kitten does something pretty handy. If you give it the URL of a dataset, it will run off, knock it around a bit like a ball of wool, and try to work out what it can about the dataset. You can then ask it what the license is, or who maintains it, without worrying about exactly how the dataset is stored or what format the metadata is in. It will give you change history if available, or perhaps a list of the files contained in the dataset.

It currently grabs metadata from DataPackages and the CKAN API used by (and others), but it could, in theory, support any dataset representation there is.

The end result is that we can auto-populate Open Data Certificates given just a URL, which is pretty handy. It also means that our increasingly-inaccurately-named git data viewer can load non-git datasets (albeit a bit slowly).

Data Kitten is still only a little ball of fluff, but we’ll be nurturing it further, and if you’d like to help it grow up into a raging data-munching tiger, consuming everything in its path, then you can join in and contribute to the code on GitHub.