Will the Real
Data Scientist
Please Stand Up!

Dr Tom Heath · Head of Research · Open Data Institute

tom.heath@theodi.org · @tommyh

Digital Research 2013, Oxford, 10/09/2013


You can access the slides at http://theodi.github.io/presentations/.

Use arrows to navigate. Press 'f' for fullscreen. Press the Escape key to see all slides.

Background

  • PhD in Social Network-driven Recommender Systems
  • Heavily involved in Linked Data community
  • Senior Research/Data Scientist at Talis
  • Data Scientist at Open Data Institute
  • Head of Research at Open Data Institute

The Open Data Institute

  • non-profit, non-partisan
  • "helping others be successful
    with open data"
  • startups, training
  • tech services
    (e.g. http://certificates.theodi.org/)
  • research, policy

Data Scien(ce|tist):
An Honest Critique

Data Science:
Why all the fuss?

Why all the Fuss?

  • "the new oil"
    "the new raw material"
  • differences
    data is a non-rival good
    marginal cost of distribution
    falling cost of analysis
  • a more level playing field
    more like gold than oil?
  • everyone wants a piece of the action

The Rise of the
Data Scientist

Pros of the Concept "Data Science"

  • shines the spotlight on data
  • highlights the potential for value
  • a unifying label for what we already do
  • a focal point for our aspirations

Cons of the Concept "Data Science"

  • a catch-all label for what we already do
  • a focal point for our confusion!

What is Data Science?

  • often defined in terms of attributes of
    the data scientist
Data science isn't just about the existence of data, or making guesses about what that data might mean; it's about testing hypotheses and making sure that the conclusions you're drawing from the data are valid.

— http://radar.oreilly.com/2010/06/what-is-data-science.html

  • just sounds like science, right?

The Naming Problem

  • biological science, computer science,
    political science, sports science...
  • the discipline is the focus, not the tool
  • where does this leave data science?

Two Interpretations of
Data Science

  • the science of data?
  • science with data?

Interpretation 1:
The Science of Data

(Open) Data
Research Topics

  • Organising and Publishing
  • what is a meaningful definition of a collection/data set?
  • how should data sets be described on the Web?
  • how can we best extract (and aggregate)
    these descriptions?
  • how do licensing choices affect
    the data ecosystem?

(Open) Data
Research Topics

  • Discovery, Comprehension and Use
  • how do we prioritise data discovery
    (i.e. crawler optimisation)?
  • can we meaningfully summarise large data sets?
  • what are the optimal indexing schemes for data?
  • how linked can data be?
  • are there novel cognitive architectures that can underpin our interactions with data?

Interpretation 2:
Science with Data

  • duh!

Give the People what they Want!

  • actionable, data-driven insights and answers...
  • ...to significant problems
  • could be business/organisational/societal
  • amounts to evidence-based everything
  • c.f. the "growth hacker"

How?

  • identify the problem/question
  • design research/analysis protocol
  • identify the required data
  • operational change to ensure the data is collected!
  • perform the analysis
  • report back to 'client'
  • close the loop through organisational change

An Example

Prescribing Analytics screenshot

  • potential £200 million saving / year

— http://prescribinganalytics.com/

Caveats

  • there has to be a research question
    (otherwise it's 'just' engineering)
  • careful of the obsession with e.g. machine learning

Training Priorities

  • statistics, experimental design, communication, business nous

Lessons for the
Research Community

  • a different style of large scale data crunching
  • nurturing communities

Key Challenges for
Data Scientists

Access to Data

  • this is not a lab environment
  • work with the data you can scrounge, and/or petition effectively within the business to start collecting what you need

Data Quality

  • again, not a lab environment
  • e.g. how testable/tested is your cleansing code?

Domain Specialism

  • meaningful questions come from domain problems, not the existence of data

Questions?

http://theodi.github.io/presentations/2013-09-oxford-real-data-scientist.html

Dr Tom Heath · Head of Research · Open Data Institute

ODI

tom.heath@theodi.org · @tommyh

ODI
Creative Commons