Git Yo'self Some Data

How we can (and can't) use open source tooling for open data

James Smith · @floppy

WTF is the
Open Data Institute?

non-profit, non-partisan
founded 2012 by Tim Berners-Lee and Nigel Shadbolt
"helping others be successful with open data"
economic, social and environmental value

WTF is
open data?

Open data is information that is available for anyone to use, for any purpose,
at no cost.

— http://theodi.org/guide/what-open-data

open data
must have have a licence to say it is open
the license
may impose some constraints:
attribution and/or share-alike

A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.

— http://opendefinition.org/

So What?

http://www.ordnancesurvey.co.uk/innovate/developers/minecraft-map-britain.html

http://prescribinganalytics.com/

http://smtm.labs.theodi.org/

http://dynamicinsights.telefonica.com/488/smart-steps

Data you can get
!= Open Data

Twitter Firehose
Google Maps
... and most others

Good Open Data

can be linked to
so that it can be easily shared and talked about
is available in a standard, structured format
so that it can be easily processed
has guaranteed availability and consistency over time
so that others can rely on it
is traceable, through any processing
so others can work out whether to trust it

Open Data Certificates

https://certificates.theodi.org/

Open Data enables...

cooperation
collaboration
building shared resources
public goods

...or at least the idea does

Data Collaboration

Image from MindJet

I ♥ Open Source!

Not Octocat by Cameron McEfee

GitHub Flow

from How Github uses GitHub to build GitHub by Zach Holman

SourceForge.net (in 2000)

GitHub ALL the things!


> %w{teachers accountants governments dogs cats hamsters DATA}.each do |x|
>   puts "GitHub for #{x}!"
> end

GitHub for teachers!
GitHub for accountants!
GitHub for governments!
GitHub for dogs!
GitHub for cats!
GitHub for hamsters!
GitHub for DATA!

er...

Git CLI

git diff --word-diff
~/.config/git/attributes
```
*.csv	diff=csv
```

~/.gitconfig

[color]
  ui = true
[alias]
  diffcsv = diff --word-diff
[diff "csv"]
  wordRegex = ...?

wordRegex=.

wordRegex=[^,\n]+[,\n]|[,]

csv-my-git

Automatically configure your local git installation for CSV

curl -L http://theodi.github.io/csv-my-git/install.sh | bash
git diffcsv test.csv

https://github.com/theodi/csv-my-git

Gitlab

Open Source GitHub-alike

http://gitlab.org/

File & diff views

http://paulfitz.github.io/coopyhx/

GitHub

Winning!

Standards

(de facto or otherwise)

http://dataprotocols.org

Data Kitten

https://github.com/theodi/data_kitten

Data Ecosystem

Dependency Tracking
Validation & Testing
Quality Metrics
Visualisation
Conversion & Decoration

Crowdsourcing!

https://github.com/benbalter/github-forms

https://github.com/datasets

https://github.com/Chicago

http://sfmoci.github.io/openlaw/

ZOMG GIT FIXES EVERYTHING

Limitations

Adding a large file (50m lines)

13m 40s

Changing a single line

8m 30s

figures by Max Ogden

Dat


# make a new dat store
dat init
# put a JSON object into dat
echo '{"hello": "world"}' | dat
# stream the most recent of all rows
dat cat
# pipe dat into itself (increments revisions)
dat cat | dat
# start a dat server
dat serve
# delete the dat folder (removes all data + history)
rm -rf .dat

https://github.com/maxogden/dat

R&Wbase

http://rawbase.github.io/

Where Next?

Server-side diff calculation
Merging
Conflict resolution
CSV dialect support
More tools!

Contribute!

ODI blog post:
- http://theodi.org/blog/adapting-git-simple-data
Gitlab fork:
- http://github.com/theodi/gitlabhq
Git CLI configurator:
- http://github.com/theodi/csv-my-git

Git Yo'self Some Data

How we can (and can't) use open source tooling for open data

James Smith · @floppy

WTF is the
Open Data Institute?

WTF is
open data?

So What?

Data you can get
!= Open Data

Good Open Data

Open Data Certificates

Open Data enables...

...or at least the idea does

Data Collaboration

I ♥ Open Source!

GitHub Flow

GitHub ALL the things!

er...

Git CLI

csv-my-git

Gitlab

File & diff views

GitHub

Winning!

Standards

(de facto or otherwise)

Data Kitten

Data Ecosystem

Crowdsourcing!

ZOMG GIT FIXES EVERYTHING

Limitations

13m 40s

8m 30s

Dat

R&Wbase

Where Next?

Contribute!

Open Data Institute Tech Team
@ukoditech
info@theodi.org
irc.freenode.net #theodi

http://theodi.github.io/presentations

Git Yo'self Some Data

How we can (and can't) use open source tooling for open data

James Smith · @floppy

WTF is theOpen Data Institute?

WTF isopen data?

So What?

Data you can get!= Open Data

Good Open Data

Open Data Certificates

Open Data enables...

...or at least the idea does

Data Collaboration

I ♥ Open Source!

GitHub Flow

GitHub ALL the things!

er...

Git CLI

csv-my-git

Gitlab

File & diff views

GitHub

Winning!

Standards

(de facto or otherwise)

Data Kitten

Data Ecosystem

Crowdsourcing!

ZOMG GIT FIXES EVERYTHING

Limitations

13m 40s

8m 30s

Dat

R&Wbase

Where Next?

Contribute!

Open Data Institute Tech Team @ukoditech info@theodi.org irc.freenode.net #theodi

http://theodi.github.io/presentations

WTF is the
Open Data Institute?

WTF is
open data?

Data you can get
!= Open Data

Open Data Institute Tech Team
@ukoditech
info@theodi.org
irc.freenode.net #theodi