What Do Statisticians Think Big Data

Big Data Week

Big Data Weekjust happened, in London and around the globe. And as others before me have pointed out, it has become de rigueur to ascribe all sorts of supernatural powers to big data.

There is a sentiment that if you have enough data you are bound to get the right answer.Astatistician will tell you more is not necessarily better; something also a data scientist, a software engineer or a business person may tell you, but probably less emphatically.

I have a feeling statisticianstend towards cynicism, perhaps because you realise how much randomness is out there. Patterns that seem miraculous, like three children born on the same day in one familiy, are almost expected in a population the size of the UK.

Probabilities are not intuitive. Numerous examples show us why, for example: How many people do you need to invite to your party, for a 50/50 chance, that two are born on the same day? [1] But this is a topic for another day, back to the wonders of big data.

Of course, the statistical community supports the Big Data Week because we recognise the enormous potential of large data sets. We can find statisticians at Google and there are many parallels between statisticians and big data analysts (who might call themselves something else).

2013 is the Year of Statistics. But what is statistics? While I don’t want to answer this question here, I find that Bradley Efron is a great source for bits of wisdom: “Statistics is the science of information gathering, especially when the information arrives in little pieces instead of big ones.”

My vision is that the data arrives in little, open pieces. Then we can stick them together and create big data. In fact, Rufus Pollock from the Open Knowledge Foundation goes one step further: “forget big data, small data is the real revolution”. He points out that size doesn’t matter and that we should focus on a distributed ecosystem of information.

We can summarise my statistical view on big data quite succinctly with:

Bigger Data ‰Better Data

For big data to be better data it needs to

  1. solve relevant problems,
  2. keep noise, bias and spurious results at a minimum,
  3. and, crucially, have a team of researchers who are able to understand it fully.

Big data advocates envision a world where we are free of hyptheses, or even the scientific method, and simply “let the data speak”. Indeed, we should leave plenty of room for methods that are not model-based. However, data is not neutral. We have to interpret it.

Nate Silver, the current poster boy of statisticians, writes in his book The Signal and the Noise about big data:

The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning. […] Before we demand more of our data, we have to demand more of ourselves.

He shows, for example, how more data has only hardened the sides in a political divide. Similarly, more data within the global warming research helped little to resolve the debate. On the other hand, Silver is actually known for his forecast models and smart predictions.

So, are we excited about big data?

It depends.

(A statistician’s answer to most questions.)

 - - - - - -

[1] That’s right: 23. And to be 99% sure? 57. See more under the birthday problem. You think conditional probabilities are better? Let me know at ulrich@theodi.org.