When Digital Dust Is Gathered, Constellation May Be Muddled
That constellation of information known as Big Data can be a sight to behold.
Adam Frank of NPR's 13.7 blog explains Big Data as "the ability to understand (and control) a seemingly chaotic world on levels never before imagined."
Big Data is like gathering digital dust, says New Yorker tech blogger Gary Marcus. "It's a very valuable tool," he says, "but it's rarely the whole solution by itself."
"We've got more powerful processors. The cost of memory gets less. The cost of storage gets less. And also, a lot more data is being created," he tells NPR's Jacki Lyden on weekends on All Things Considered. "Everything from social media sites, to buying things online with e-commerce, to simulations going on in companies' medical facilities. Everyone is creating more and more information."
Marcus says "Big Data" versus just "data" is really a matter of magnitude. But he says that's important because quantity can actually make a difference in the story the data tells.
Here's the catch: While Big Data can uncover correlations between data points, it doesn't reveal causation. Sometimes, that doesn't really matter, but other times, it might — in ways we're not always aware of.
For instance, the city of Boston developed a smartphone app called Street Bump to track potholes. "It passively collects GPS data and accelerometer data so it can report when you drive over potholes," Kate Crawford, a researcher for Microsoft Research and visiting professor at MIT, tells Lyden.
The thing is, not everyone bouncing around on bad roads has a smartphone — or the app, for that matter — to help gather the data.
"And indeed, this maps very closely around how rich a particular neighborhood is and what the ages of the people who live in that neighborhood [are]," Crawford says.
She says the ethics get tricky with that sort of skewed data. The city is trying to address the discrepancy by working with academics. But Boston isn't the only city taking on such projects. The Wall Street Journal reports on an initiative in New Jersey to manage traffic.
To summarize the tension around Big Data, New York Times reporter Steve Lohr quotes Albert Einstein: "Not everything that counts can be counted, and not everything that can be counted counts."
Lohr, who writes about Big Data and privacy, tells Lyden he's more concerned about what the data gets wrong than how much is revealed about us.
"The real danger for most of us is discrimination by statistical inference," he says.
Remember when you searched for "deep fryer" online for your cooking class? Well that action could now be associated with unhealthy behavior, and the nuance gets lost.
It's not just online tracking consumers should be wary of, Lohr says. Credit cards, for example, store a lot of information that you might not think of. His advice:
"Put the health club membership on the credit card, but your visits to the liquor store should be in cash because those things will follow you in ways you don't know now."
JACKI LYDEN, HOST:
You're listening to WEEKENDS on ALL THINGS CONSIDERED from NPR News. I'm Jacki Lyden.
Coming up in the next hour, we check in with The Atlantic's James Fallows and remember the prima ballerina from the plains of Oklahoma, Maria Tallchief. And also, the latest copyright concerns from writer and Authors Guild president Scott Turow.
But first, let's start today with a scene from "Batman," "The Dark Knight." In this scene, Batman stands in front of a row of blinking video monitors looking for the Joker.
(SOUNDBITE OF "THE DARK KNIGHT")
MORGAN FREEMAN: (as Lucius Fox) You've turned every cellphone in Gotham into a microphone.
CHRISTIAN BALE: (as Batman) And a high-frequency generator receiver.
FREEMAN: (as Lucius Fox) You took my sonar concept and applied it to every phone in the city. With half the city feeding you sonar, you can image all of Gotham. This is wrong.
LYDEN: Funny how a film about technology that seemed so impossible just a few years ago seems ever more present. But in a world of big data, what you can track seems infinite. That's our cover story today: big data, the aggregation of our digital data trail, who collects it, how they use it and where we're going with it.
Gary Marcus writes for the New Yorker's tech blog. He says culminating big data is like gathering digital dust.
GARY MARCUS: Big data - one way to think about it is it's a very valuable tool, but it's rarely the whole solution by itself.
LYDEN: Marcus tells us about Watson. That was the supercomputer IBM designed to compete on "Jeopardy!" on TV a couple of years ago. Alex Trebek even introduced Watson.
(SOUNDBITE OF GAME SHOW, "JEOPARDY!")
ALEX TREBEK: Developed and programmed especially for this moment, making its first appearance on our national television program. Ladies and gentlemen, this is Watson.
MARCUS: Part of the way it did so well is they had a massive database, namely Wikipedia, built into it. And they were able to trawl that and also other aspects of the Web in order to make good guesses.
(SOUNDBITE OF GAME SHOW, "JEOPARDY!")
WATSON: Who is Jude?
TREBEK: Yes. Watson?
WATSON: Who is Michael Phelps?
TREBEK: Yes. Watson?
WATSON: What is event horizon?
LYDEN: Watson appeared to understand Trebek as he asked the questions. He competed against the two top "Jeopardy!" champions, and Watson won. But big data doesn't function like a human brain.
MARCUS: There are also other things like classic artificial intelligence techniques like temporal reasoning, trying to figure out, well, was this person alive before this other person? Big data might get confused if you just used it by itself. It might say, well, these two things are very correlated, but it doesn't say, well, which one happened before which other one?
LYDEN: We've always had access to digital information, of course, ever since it was invented. But to understand how much there is now and how fast things are changing, we turn to Chris Barnatt. He teaches computing at Nottingham University in England and runs a site called ExplainingComputers.com.
CHRIS BARNATT: We've got more powerful processors. The cost of memory gets less. The cost of storage gets less. And also, a lot more data is being created. Everything from social media sites, to people doing - buying things online with e-commerce, to simulations going on in companies' medical facilities. Everyone is creating more and more information.
LYDEN: And we're processing this data differently, using it almost narratively to make patterns and predictions.
BARNATT: Hospitals get lots and lots of information from their patients, and they don't just capture information on them in terms of who they are and their details. But they, for example, might put a camera inside someone during an operating - an operation and look at that data and probably just throw it away because they wouldn't see any value in that data after the operation.
In a big data world, you'd say, well, if you can capture all the information from all the cameras put into every patient in the world in a year, you could look at all that data with a piece of artificial intelligence and start to predict things about the actual health of the population.
LYDEN: And the faster this goes, the more clients want it. Svetlana Sicular is a director at Gartner, a data research firm. She helps their business clients figure out what they can do with all that wealth of information.
SVETLANA SICULAR: Data is becoming pervasive. It's getting new qualities. For example, we - maps are the data, or text or voice or video or geolocation of where you are - we can know everything about it.
LYDEN: Big data's not just what happens when marketing companies target their audience. You know, like those ads that pop up on a search engine. You search for one pair of shoes, you get shoe ads for weeks. It's also diagnostic. Like Chris Barnatt in England, Svetlana Sicular sees big data making advances in medicine.
SICULAR: Medical records are being digitized by analyzing massive amounts of text records over those doctor's notes, which were sitting somewhere in folders on the shelves. We can find out something which is really astonishing. We can find from doctor's notes that it's enough to look for some particular vein on the neck and you will be able to predict that there's the best indicator of a disease.
LYDEN: Then there are the dig data skeptics. Big data doesn't think. It evaluates, aggregates and predicts. But reasoning, you could say there's a breakdown between cause and effect.
Kate Crawford, a researcher for Microsoft and a visiting professor at MIT, gives an example from Boston. The city wanted data to fix its potholes, so it developed a smartphone app called Street Bump.
KATE CRAWFORD: It passively collects GPS data and accelerometer data so that it can report when you drive over potholes.
LYDEN: Cool, right? Instead of having to actively seek out the potholes, the city could use people's phones to figure out when the car suddenly thumped over a rough spot. Except for one thing: The app only captured data in more affluent neighborhoods.
CRAWFORD: And, indeed, this maps very closely around how rich a particular neighborhood is and what the ages of the people who live in that neighborhood.
LYDEN: Skewed data that collates only from better-off people might be missing a lot more than potholes. The ethics get tricky quickly. Boston's still working to fix both the app and the potholes. But Crawford warns that this kind of situation might happen more often as we look to big data for policy solutions.
Steve Lohr has been writing about big data and privacy at The New York Times where he's a reporter. Remember our "Dark Knight" example from Gotham? Or we could just as easily have turned to Orwell because technology is allowing us to track movements around the Web and the real world. Lohr isn't sure that all that Big Brother information needs to be tracked or should be. He quotes Einstein.
STEVE LOHR: Not everything that counts can be counted, and not everything that can be counted counts. And that's the tension here.
LYDEN: Not everything that can be counted counts. But the risk, it seems to me, is that there is a massive invasion of privacy risk. And this is something you've written a lot about.
LOHR: Look, my concern, for instance, in tracking all your online behavior and aggregating it and making assumptions about you is not that these algorithms are so clever and they know everything. It's that they're so stupid. The real danger for most of us is discrimination by statistical inference.
LYDEN: Because it attributes a causality that might not be there.
LOHR: Exactly. And, you know, if you're a simple person, maybe the algorithms sort of work. But if you have a complex set of motivations, you know, you search for deep fat fryers, right? And it may be because you're in a cooking school and that was an assignment. Or you're going to buy this for a friend, right? It correlates to unhealthy behavior.
It's not just online. I think the place people really give up their privacy and don't realize it is putting everything on a credit card. That's the gold standard. It's got everything on you, and it's all being correlated and put together by these data brokers who track online behavior as well as, you know, your credit card transactions. And so, you know, what happens is your world will be shaped by what your transactions are.
And so, you know, put the health club membership on the credit card, but your visits to liquor stores should be in cash, because those things will follow you in ways you don't know now.
LYDEN: You write that The World Economic Forum issued a report last month that recommended ways in which privacy concerns could be addressed. And they drew from an example of a 1970 Fair Credit Reporting law. Tell us more about that.
LOHR: Yeah. This was the earlier round of privacy concerns related to computers. And this goes back to when the world started moving to mainframe computers. And so large corporations and government agencies had increasingly put personal information about your transactions. To some degree, they started putting IRS tax returns on computers. And that was a real wave of concern.
And the Fair Credit Reporting Act was a limitation on what you could collect and how you could use it. It could be used, you know, somewhat for hiring and for loan evaluation. But other than that, you couldn't, you know, you couldn't use it card punch. And the people who are in the privacy community these days kind of look back at that as a small beer by comparison to now.
LYDEN: To what's available now.
LOHR: But people were much more sensitive to it back then. You know, these days, you kind of give up a little bit more and a little bit more. And there's disadvantages to using all these things. Federal Trade Commission is, you know, is investigating and has taken some actions, particularly in children and privacy, but also looking at these sort of data brokers in between. So those are the ones that facilitate the black box.
And the real concern is that the, you know, again, the - how the data is used so that the things that you see in the world and the opportunities you're given will be determined by an algorithmic black box that you don't have any control over and may be wrong about you. And the way regulation is now looking at in the investigation of the Federal Trade Commission is doing is, you know, pushing for greater transparency. So - and this World Economic Forum report that talks about, there should be an audit trail to your data in a way that there isn't now.
LYDEN: So if I'm an individual and I am leaving a data dust trail that is just so extensive, it's difficult to know where you would begin to try to rein some of that in in a practical way. How do you protect yourself? Is there any - are there any kind of steps you'd take to limit what is known about your transactions?
LOHR: You know, I'm a participant. I don't really have a good answer, Jacki. Be careful of your online behavior. I mean, it has consequences. It's the opposite of that old New Yorker cartoon where, you know, on the Internet, nobody knows your dog.
LYDEN: Mm-hmm. Mm-hmm.
LOHR: They know you. They know you.
LYDEN: Steve Lohr. He writes about technology for The New York Times.
If we're all just ashes to ashes and dust to dust, big data can keep track at a molecular level. Whether that's a comfort to you or a concern, knowledge is generally power. The question is, for whom? Last spring, the U.S. government invested over $200 million in big data. Transcript provided by NPR, Copyright NPR.