16 February 2012

In torrential rains, you need an umbrella. For torrents of data, you need statistics

Technically, "because I didn't have observational data."
Working with experimental data requires you only
to be able to calculate two means and look up
a t-statistic on a table.
The excellent Silbey at the Edge of the American West is stunned by the torrents of data that future historians will be able to deal with. He predicts that the petabytes of data being captured by government organizations such as the Air Force will be a major boon for historians of the future --

and I surely can't be the only person who always says "Of the future!" in the same way that the announcers of the 1930s Flash Gordon serials would announce the impending arrival of aliens --

but that this torrent of data means that it will take vastly longer for historians to sort through the historical record.

He is wrong. It means precisely the opposite. It means that history is on the verge of becoming a quantified academic discipline.

The sensations Silbey is feeling have already been captured by an earlier historian, Henry Adams, who wrote of his visit to the Great Exposition of Paris:
He [Adams] cared little about his experiments and less about his statesmen, who seemed to him quite as ignorant as himself and, as a rule, no more honest; but he insisted on a relation of sequence. And if he could not reach it by one method, he would try as many methods as science knew. Satisfied that the sequence of men led to nothing and that the sequence of their society could lead no further, while the mere sequence of time was artificial, and the sequence of thought was chaos, he turned at last to the sequence of force; and thus it happened that, after ten years’ pursuit, he found himself lying in the Gallery of Machines at the Great Exposition of 1900, his historical neck broken by the sudden irruption of forces totally new.

Because it is strictly impossible for the human brain to cope with large amounts of data, this implies that in the age of big data we will have to turn to the tools we've devised to solve exactly that problem. And those tools are statistics.

It will not be human brains that directly run through each of the petabytes of data the US Air Force collects. It will be statistical software routines. And the historical record that the modal historian of the future confronts will be one that is mediated by statistical distributions. Inshallah, yes, this means that historians will debate whether a given event was caused by a process that follows a negative binomial or a Poisson distribution.

And scholarship will be better for it.


I do have the notes for my long-promised sequel to "What do quallys know, anyway?" , tentatively entitled "Quantoids don't know anything," and this will probably prompt me to finally finish it.


  1. I'd agree that software will help, but that assumes that the video is properly prepared for and accessible to those kind of searches. That'll be true of a lot of it, but I suspect that vast quantities will become like a lot of the paper record; useful to historians but unimportant enough to be processed for digital accessibility.

    Thank you for the "excellent." You're very kind.

    1. "that assumes that the video is properly prepared for and accessible to those kind of searches"

      Very possibly. Although Google's IP-enforcing automatic video search makes me think that this is largely a problem resolvable by computing power. (Legal restrictions might be something else entirely.)