The myki Privacy Breach. What Now for Open Data?

Imagine that someone you once went on a single date with could find out where you live, work and socialise, and when. That's the reality uncovered by University of Melbourne researchers, who have once again highlighted the risks of releasing supposedly “de-identified” open data in unit record form.

In 2018, Public Transport Victoria released 1.8 billion travel records showing “touch on” and “touch off” events for 15 million myki cards (Melbourne's public transport smart card ticket). The data covered a three year period up to June 2018, and was intended for use in a data hacking event, the Melbourne Datathon. In addition to the trip records, the dataset also included the myki card type, identifying which cards were concession/discounted cards issued to groups such as children, seniors, refugees, police and politicians.

While PTV believed the dataset did not contain personally identifying information—there were no names released and the myki card numbers had been replaced with a different generated ID value—researchers found that it was trivially easy to identify real people in the dataset. As with many similar incidents, in the myki privacy breach, combining a surprisingly small amount of known external information with the released data was enough to uncover a wealth of personal information about someone. In this case three years of detailed timestamped travel activities revealed all sorts of information about that person's life.

The researchers easily found themselves in the dataset by cross referencing against their own travel records from their myki accounts. Once they had located themselves, they could use this to extract the entire three year travel history of other passengers who had travelled with them, even if the co-travel was just a single trip:

This type of re-identification is particularly concerning, since it allows an individual to leverage the ease of re-identifying themselves to re-identify others, and from potentially only a single co-travel event. This presents a risk for anyone who has co-travelled with someone in the past, for example, an ex-partner, a co-worker, or even just someone they went on a single date with. Due to the large amount of data provided, i.e., all touch on and off events, it could allow a malicious party to determine where someone lived, worked, or socialised—and when they visit these places and for how long.

Dr. Chris Culnane, A/Prof. Benjamin I. P. Rubinstein, A/Prof. Vanessa Teague

The University of Melbourne, Australia

But you didn’t even have to have travelled with someone to find their records in the dataset.

The extra information that indicated concessional card types meant that certain individuals were even easier to locate in the data release. The team demonstrated this by looking for Victorian MP Anthony Carbines in the dataset. With his permission, they used information posted publicly to Mr Carbines' Twitter account to identify his detailed travel activities over the three year period.

For its part, despite an investigation by the Office of the Victorian Information Commissioner (OVIC) finding that as a result of the myki privacy breach, PTV had breached privacy legislation when it issued the data in this form, PTV maintains that the dataset does not contain personally identifiable information:

So what are the implications for Open Data? Firstly, we hope that this myki privacy breach does not lead to a reduction in the volume and utility of open data releases. Open data, done right, offers many potential benefits to people and society, by enabling better data-driven decisions to be made about the world we live in.

As this research demonstrates, it may not be possible to release unit record data by itself in a way that protects individual privacy.

One approach to mitigate the risk is to release only aggregated data, but this runs the risk of making the data less useful, as it will only contain specific combinations of variables that have been generated by the data publisher.

Here at WingArc, we believe there’s a better way. And this is precisely why we developed SuperWEB2, our data dissemination solution, and perturbation, our confidentiality algorithm.

Our platform allows users to run any query they like against a dataset, but presents only the aggregated results. While the query runs against the unit records, the unit records themselves are never exposed to the end user. It's a “best of both worlds”, self-service approach that allows users to ask any question of the data, while still applying tight controls to the released data.

It’s a world away from uploading unit record data to the cloud.

Further, our perturbation algorithm automatically applies subtle but consistent adjustments to the values returned that mitigate the risk of identifying an individual without introducing bias or affecting the general trends or patterns reflected in the results.

Business Intelligence, Data and IoT Platform

Solutions by Technology

Explore Our Products

Data Analytics

Business Process Solutions

Industry Reports

WingArc Reports

Webinars

Past and Upcoming Webinars

Data Stories

Need help?

Support Options

Case Studies

About Us

The myki Privacy Breach

What It Means For Open Data, and How To Release Data Without Compromising Individuals

Dr. Chris Culnane, A/Prof. Benjamin I. P. Rubinstein, A/Prof. Vanessa Teague

Department of Transport

What Does this Mean for Open Data?

Want to Know More?

Try SuperWEB2 Now

Learn More

Contact Us

Matt Armstrong

Related Posts

Data Breach Highlights Importance of Confidentiality

Confidentiality for the Big Data World

The Search for Privacy in a Big Data World