In the late 90s, a Massachusetts government agency released medical records for every state government employee, free of charge, to the research community. At the time of this early open data initiative, state governor William Weld went on record to say that the confidentiality of everyone involved was protected. After all, they had removed all explicit personally identifiable information, such as names, addresses and social security numbers, and they were sure that this protection was enough to keep all those individuals anonymous.
But a young graduate student at MIT by the name of Latanya Sweeney had a hunch that all was not well. She went looking through the data, and with remarkable ease managed to identify a key individual: state governor William Weld. To illustrate her point, she downloaded his personal medical records, with prescription details, and had them delivered to his office.
So how did Sweeney find the governor in the records? It was remarkably simple: she combined the released record data with another easily available data source, the electoral roll. The fact that the governor lived in Cambridge, Massachusetts was a matter of public record. So Sweeney paid $20 to access details of all 54,000 Cambridge residents. Matching this up against the medical records made finding the governor a breeze: only 6 people in Cambridge shared his birth date, only three were male and only one lived in his ZIP code.
Almost 20 years later it seems we have failed to learn the lessons of how easy it can be to re-identify individuals in supposedly anonymised data sets, simply by matching them up with other available data. In today’s connected social media world there is more information out there in the public domain than ever before, and it can often be worryingly easy to de-anonymise released data.
Is big data privacy even possible?
A recent study analysed three months of credit card data belonging to 1.1 million people. Financial institutions routinely sell this type of data to other organisations: it’s a valuable revenue stream, as they get to monetise data that they were already collecting anyway.
Of course the credit card companies, just like the Massachusetts government, remove explicit identifiers from the data such as names and account numbers.
But just like Sweeney, researchers were able to identify individuals in the data with ease. They simply combined the credit card data with information freely posted by individuals to social media. Remarkably, just four pieces of information were enough to successfully match a person to their credit card details, 90% of the time:
As a result of the Sweeney case, the US government introduced new privacy protections for the release of health data. The Health Insurance Portability and Accessibility Act (HIPPA), for example, stipulated that ZIP codes must be generalised to just the first three digits, hugely increasingly the number of individuals who would be covered by any one ZIP code in the data. ZIP codes with very small populations cannot be published at all, and must be replaced with 000.
But waiting for researchers to breach supposedly anonymous data sets, and then preventing the specific flaws that allowed the breach, is not a long term solution to the confidentiality problem. Like a game of whack-a-mole, even as you lock down one attack vector, another one pops up in its place. And who knows who might have taken advantage of the flaw in the meantime...
And the more adjustments you make to the data—such as removing an entire range of ZIP codes—the less useful it becomes.
It’s clear that we need a better way to protect individual privacy in big data. We need to keep the data as useful as possible, while providing robust, scalable and secure confidentiality.
A Better Way
Stay tuned for part two, where we look at how WingArc Australia's sophisticated confidentiality algorithms can help...
Image: “System Lock” by Yuri Samoilov used under Creative Commons