Wednesday 16 January 2019

Big Data's "theory-free" analysis is a statistical malpractice

a post by Cory Doctorow for the Boing Boing blog



One of the premises of Big Data is that it can be "theory free": rather than starting with a hypothesis ("men at buffets eat more when women are present," "more people will click this button if I move it here," etc) and then gathering data to validate your guess, you just gather a ton of data and look for patterns in it.

The thing is, patterns emerge in every large dataset, without necessarily being representative of a wider statistical truth. Think of the celebrated rise and fall of Google Flu: researchers examined the 45 search terms that were most prevalent where the flu had spread and concluded that these were predictors of flu, but the predictive power turned out to be an illusion. Every place has 45 top search terms, all the time, and some of them will coincide with flu outbreaks, but without a causal theory that you can test, all you know for sure is that you've found an incident of correlation, and no way to know whether the correlation is coincidence or a newly discovered iron law.

Continue reading

I have only thought to add. “Eight out of ten cats …”.


No comments: