Recent media coverage of work by University of Melbourne researchers highlights once again the risks of publishing open data without fully considering the privacy implications.
The team found that the de-identified medical records of nearly 3 million Australians, which had been published on the Australian Government Open Data portal, were in fact not quite so de-identified. By combining the released records with other publicly available data, the researchers were able to identify a number of individuals.
This type of breach is known as a "linkage attack": if you combine information in the data release with other information that you know, then you may be able to identify individuals in the data.
As with many similar previous examples, it took surprisingly few additional facts to breach this dataset:
Our findings replicate those of similar studies of other de-identified datasets:
- A few mundane facts taken together often suffice to isolate an individual.
- Some patients can be identified by name from publicly available information.
- Decreasing the precision of the data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility.
It is clear that merely removing personally identifiable information and then releasing unit records is not enough to ensure confidentiality of the individuals in your data.
However, we believe that there is a better way to release useful data while maintaining individual privacy. And this is precisely why we worked with the Australian Bureau of Statistics to develop Perturbation, our sophisticated confidentiality solution for open data.
The University of Melbourne team conclude that:
There is no good solution for publishing sensitive unit-record level data that protects privacy without substantially degrading the usefulness of the data.
While that is very likely true if you are releasing raw unit records, our solution provides what we believe is the best compromise between flexibility and confidentiality.
It combines our perturbation protection with a sophisticated web-based data dissemination tool (SuperWEB2) and our open standard API.
Consumers of the data -- from researchers to the general public -- can use these tools to ask any question that they like. The aggregated results are tabulated on-the-fly from the underlying unit record data, and protected with perturbation, before being returned to the end user.
Most importantly, the unit records themselves are never released. That's because when users have the power to ask any question, they don't need to be.
Want to Know More?
Our solution has been protecting highly sensitive data released by organisations around the world for many years.