This post is a part of a series about mining everyday data, based on a talk Rich Ziade and I are giving at the IA Summit in Memphis later this month.
I recently read Stephen Baker’s Numerati which details how the math elite are increasing their influence over our lives. It was a decent read, albeit one that might have been better as a magazine article. That said, one quote from the book has stuck with me and greatly influenced my upcoming talk at the IA Summit:
Jack Einhorn, the chief scientist at a New York media start-up called Inform Technologies, predicts that the great discoveries of the twenty-first century will come from finding patterns in vast archives of data. "The next Jonas Salk will be a mathematician," he says. "Not a doctor."
This is a stunning statement – that an upcoming significant medical breakthrough might be found by a mathematician and not a doctor – and speaks to the unrealized power that lies in sifting through and making sense the vast sea of data that exists today. A number of related trends are increasing the possibility that the next great scientific breakthrough might come from someone spotting a previously unknown relationship in data:
- more of our lives are moving towards being explicitly carried-out or indirectly tracked digitally
- computers continue to get smaller and more pervasive
- storage continues to get cheaper
- semantic relationships are being established in data
- the ubiquitousness of APIs and integration
- the call for opening up islands of data
- legislation forcing transparency
- capitalistic motivations
So who might the next Jonas Salk be? Surely you would have to consider one of the thousands of researchers currently analyzing existing data or setting up scientific studies attempting to prove out theories emerging out of correlations in data. Maybe a university researcher will tie a specific gene present in people consuming a specific diet exposed to a specific environmental trigger and isolate those as the cause of a disease – one impossible to spot until now.
What I find more interesting though is that the next Jonas Salk might not be a professional or academic at all, he or she might be an interested amateur.
Pork and Beans
In the late 1970s, a night watchman at the Van Camp’s pork and beans factory in Kansas named Bill James spent his spare time sifting through the vast available record of baseball statistics and would in turn instigate a revolution in the sport. Before James, baseball players were largely judged by scouts employing their own and oftentimes subjective or emotional criterion. James pulled the clothes off of this emperor using objective reason and statistics. Baseball was perfectly ripe for this revolution – in addition to the preexisting inefficiency in the marketplace for players, there was a 100+ year record of objective performance data to be exploited.
For those that don’t know the rest of this story, James inspired an army of statisticians who have since dominated the game (most famously described in the [all-time great] book Moneyball by Michael Lewis) and was recently hired by the Boston Red Sox. The outsider-amateur is now the ultimate inside professional.
Similarly, you’re probably familiar with FiveThirtyEight.com, where amateur statistician Nate Silver has revolutionized political polling with statistics. His call of 50 out of 51 contests in the 2008 presidential election and all of the Senate races in the same year was stunning and beat just about every "expert". What you might not know is Silver is a well-known baseball writer stamped out of the Bill James mold, and in fact founded a son-of-Bill-James baseball publication Baseball Prospectus.
There’s something romantic and uniquely American about this narrative – the passionate outsider-amateur seeking objective truths in data. So if the next Jonas Salk turns out to be a mathematician, the history books might just trace their origins straight back to that pork and beans factory in Lawrence, Kansas.
Chris Dary Said:
Great thoughts, Tim.
I think we as a species haven’t even really grasped the potential power that we have waiting for us in all the data we’re creating every day. It’s going to be a long road, but when we get there I think our everyday lives will be very different.
Being a software guy, I’m cursed with thinking computationally; as a result my head goes right to the web when questions like this come up. I think web services play a huge role in the amateurization of data mining right now. In the future, if we as a collective can nail something like the semantic web, we’ll be in a much better place.
But for now, specific data mining is too hard. We’re certainly leaps and bounds better off than our progenitors, but I think we’d see even more world changing effects if the barrier to getting at the data we want to was lower.
Thankfully, for now we do have great resources like Programmable Web’s API Directory, and a lot of companies opening up their data. But I still feel like we’re seeing the data we have and tailoring our questions to it, not seeing what questions we have and retrieving the data to answer them.
Soon enough, hopefully. It’ll be an even more exciting time when we get there.
Richard Ziade Said:
What’s most exciting about this, in my mind, is that it’s a reboot of sorts. Whether they like it or not, researchers who’ve labored over multiple sclerosis for 30 years have inevitably hit the same walls for years.
It’s a whole new fertile ground. They say innovation is key these days. I agree. Now all we have to do is tend to the soil.
Chris LoSacco Said:
@Chris – have you seen the buzz around Wolfram Alpha? Speaks to your point about questions v. data.
Chris Dary Said:
@ChrisL – yeah, I’ve seen it, but I think it’s a much more specific answer than I’m going to be looking for.
I’d want large amounts of data through which I can sift and find trends, not a single sentence answer.