Imagine that someone you once went on a single date with could find out where you live, work and socialise, and when. That's the reality uncovered by University of Melbourne researchers, who have once again highlighted the risks of releasing supposedly “de-identified” open data in unit record form.
In 2018, Public Transport Victoria released 1.8 billion travel records showing “touch on” and “touch off” events for 15 million myki cards (Melbourne's public transport smart card ticket). The data covered a three year period up to June 2018, and was intended for use in a data hacking event, the Melbourne Datathon. In addition to the trip records, the dataset also included the myki card type, identifying which cards were concession/discounted cards issued to groups such as children, seniors, refugees, police and politicians.
While PTV believed the dataset did not contain personally identifying information—there were no names released and the myki card numbers had been replaced with a different generated ID value—researchers found that it was trivially easy to identify real people in the dataset. As with many similar incidents, in the myki privacy breach, combining a surprisingly small amount of known external information with the released data was enough to uncover a wealth of personal information about someone. In this case three years of detailed timestamped travel activities revealed all sorts of information about that person's life.
The researchers easily found themselves in the dataset by cross referencing against their own travel records from their myki accounts. Once they had located themselves, they could use this to extract the entire three year travel history of other passengers who had travelled with them, even if the co-travel was just a single trip:
Dr. Chris Culnane, A/Prof. Benjamin I. P. Rubinstein, A/Prof. Vanessa Teague
But you didn’t even have to have travelled with someone to find their records in the dataset.
The extra information that indicated concessional card types meant that certain individuals were even easier to locate in the data release. The team demonstrated this by looking for Victorian MP Anthony Carbines in the dataset. With his permission, they used information posted publicly to Mr Carbines' Twitter account to identify his detailed travel activities over the three year period.
For its part, despite an investigation by the Office of the Victorian Information Commissioner (OVIC) finding that as a result of the myki privacy breach, PTV had breached privacy legislation when it issued the data in this form, PTV maintains that the dataset does not contain personally identifiable information:
It is not possible to re-identify individuals based on the data PTV provided alone. A lot more information and further steps are required from other sources, along with private knowledge, data science expertise and capability, for the scenarios mentioned in the Report to arise.
Department of Transport
This position spectacularly misses the point.
Data publishers cannot absolve themselves of a privacy risk by simply ignoring easily obtained external information. Similar research has demonstrated on many occasions that it typically requires only a handful of additional data points to uniquely identify an individual in detailed unit record data. This external information is often easily available to someone determined to locate a specific individual.
What Does this Mean for Open Data?
So what are the implications for Open Data? Firstly, we hope that this myki privacy breach does not lead to a reduction in the volume and utility of open data releases. Open data, done right, offers many potential benefits to people and society, by enabling better data-driven decisions to be made about the world we live in.
As this research demonstrates, it may not be possible to release unit record data by itself in a way that protects individual privacy.
One approach to mitigate the risk is to release only aggregated data, but this runs the risk of making the data less useful, as it will only contain specific combinations of variables that have been generated by the data publisher.
Here at WingArc, we believe there’s a better way. And this is precisely why we developed SuperWEB2, our data dissemination solution, and perturbation, our confidentiality algorithm.
Our platform allows users to run any query they like against a dataset, but presents only the aggregated results. While the query runs against the unit records, the unit records themselves are never exposed to the end user. It's a “best of both worlds”, self-service approach that allows users to ask any question of the data, while still applying tight controls to the released data.
It’s a world away from uploading unit record data to the cloud.
Further, our perturbation algorithm automatically applies subtle but consistent adjustments to the values returned that mitigate the risk of identifying an individual without introducing bias or affecting the general trends or patterns reflected in the results.
Want to Know More?
Our solution is in use at statistical agencies all over the world, protecting individual privacy in datasets like the Australian Bureau of Statistics census data and the UK’s pensions and social services data.