Git Yo'self Some Data
How we can (and can't) use open source tooling for open data
James Smith
ยท
@floppy
WTF is the Open Data Institute?
non-profit, non-partisan
founded 2012 by Tim Berners-Lee and Nigel Shadbolt
"helping others be successful with open data"
economic, social and environmental value
open data must have have a licence to say it is open
the license may impose some constraints:attribution and/or share-alike
A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.
— http://opendefinition.org/
So What?
why should you care? 1) because open data frees you up to build cool things without having to pay for the data or collect it yourself; 2) because your clients/customers will probably start to care.
Data you can get != Open Data
Twitter Firehose
Google Maps
... and most others
Good Open Data
can be linked to so that it can be easily shared and talked about
is available in a standard, structured format so that it can be easily processed
has guaranteed availability and consistency over time so that others can rely on it
is traceable, through any processing so others can work out whether to trust it
Open Data enables...
cooperation
collaboration
building shared resources
public goods
...or at least the idea does
Data Collaboration
Consider the world of Open Data, where we have a load of data, but very little collaboration.
Most data is dropped into central datastores, and that's it.
If I use your dataset and find an error, the only way to get it fixed is to tell you about it,
and hope you can be bothered to sort it out.
Image from MindJet
collboration; reuse; serendipity!
I ♥ Open Source!
Git and GitHub. How github has revolutionised the process of contributing to OSS projects.
If you have some code, I can fork it, make my own changes, then hit a single button to merge those
changes back upstream. This makes contribution incredibly simple, so that the admin overhead of doing
this becomes almost zero. This allows projects to draw on a wider pool of contributors than otherwise
would have been available.
Not Octocat by Cameron McEfee
Github use a collaboration process they call 'Github Flow' (not to be confused with 'Git flow', which is
more complex).
Wouldn't it be great if we could do this with open data?
GitHub Flow
from How Github uses GitHub to build GitHub by Zach Holman
Unfortunately, compared to the open source world, this is a pre-sourceforge level of
collaboration. We can do better.
SourceForge.net (in 2000)
This often gets referred to as 'git for data', though in my view git is unimportant. It's all about flow.
Github's revolution was not that they used git - it's that they built powerful, simple workflow tools on
top of it.
This is not a new idea; people have been talking about it for years. Only problem is there are a few
problems when it comes to handling data in git, and other systems designed for source code.
I think we can get a long way with existing tools, however. If we can bend git to our will,
and use it to work with simple data in useful ways, then we can get this revolution started.
GitHub ALL the things!
> %w{teachers accountants governments dogs cats hamsters DATA}.each do |x|
> puts "GitHub for #{x}!"
> end
GitHub for teachers!
GitHub for accountants!
GitHub for governments!
GitHub for dogs!
GitHub for cats!
GitHub for hamsters!
GitHub for DATA!
So we know what we're up against, let's look at some problems with git when working with things like CSVs.
First, it's line-oriented, built for source code. This is OK when adding a row, or changing a few cells,
but add a column and suddenly you have a change on every line.
Let's look at a bit of test data. This is a small CSV file, and I've made some changes. First thing that you can
see is that I've obviously added a column. However, it's hard to see if anything else has changed, because the diff is
utterly useless.
er...
There are some things we can improve here though. First thing to realise is that we don't really care what git does
internally; how it stores our changes, and so on. As long as we can see what's going on, git can do what it wants inside.
That means that this is a tooling problem, so we can tinker with the tooling around the edges to try to fix our problems.
Let's start with the git tool that *everyone* has; the command line.
Git CLI
wordRegex=.
wordRegex=[^,\n]+[,\n]|[,]
csv-my-git
Automatically configure your local git installation for CSV
curl -L http://theodi.github.io/csv-my-git/install.sh | bash
git diffcsv test.csv
https://github.com/theodi/csv-my-git
This is all very well, but what makes github flow really usable is... github. How can we get CSV diffs
into Github? Unfortunately, their core display code isn't open source, but we have the next best thing: Gitlab.
Gitlab
Open Source GitHub-alike
http://gitlab.org/
So, all we need to do is change the views for files and diffs to add CSV support. Files are pretty easy, but CSV is harder,
mainly because just working out the diffs is non-trivial. Coopyhx to the rescue.
http://paulfitz.github.io/coopyhx/
With the coopyhx javascript library doing all the hard work, adding diff rendering is actually really easy.
We published this on our blog a few weeks ago, and a few days later, github announced CSV support in their web interface!
GitHub
Standards
(de facto or otherwise)
Data Ecosystem
Dependency Tracking
Validation & Testing
Quality Metrics
Visualisation
Conversion & Decoration
ZOMG GIT FIXES EVERYTHING
Limitations
Adding a large file (50m lines)
13m 40s
Changing a single line
8m 30s
figures by Max Ogden
Dat
# make a new dat store
dat init
# put a JSON object into dat
echo '{"hello": "world"}' | dat
# stream the most recent of all rows
dat cat
# pipe dat into itself (increments revisions)
dat cat | dat
# start a dat server
dat serve
# delete the dat folder (removes all data + history)
rm -rf .dat
https://github.com/maxogden/dat
Where Next?
Server-side diff calculation
Merging
Conflict resolution
CSV dialect support
More tools!
Contribute!
ODI blog post:
Gitlab fork:
Git CLI configurator: