Why “Don’t worry it’s de-identified” should (still) be a red flag when considering privacy risk

‘Fingers crossed’ is not smart business strategy. The Privacy Commissioner’s report into I-MED shows that to pursue innovation and compliance, getting de-identification right is complex – but worth it.

Last month’s Office of the Australian Information Commissioner (OAIC) report into I-MED’s disclosure of patient records to an AI company should not be seen as permission to engage in an AI ‘free-for-all’.

In finding that I-MED was not in breach of the Privacy Act when it supplied de-identified patient imaging records to a third party in order to train an AI diagnostic model, Privacy Commissioner Carly Kind positioned her report squarely within the current political debate about harnessing AI to improve productivity while also protecting Australians from harm. However in doing so, Commissioner Kind also served up a reminder that “developing an AI model is a high privacy risk activity when it relies on large quantities of personal information,” and that the Privacy Act “applies to the collection, use and disclosure of personal information to train AI models, just as it applies to all uses of AI that involve personal information.”

Kind continued: “This case study shows how good governance and planning for privacy at the start of a new initiative can support an organisation to adopt new and innovative data-driven technologies in a way that protects the rights of individuals.”

This decision also serves as a case study to illustrate that while some might regard de-identification as a silver bullet solution to their privacy compliance, in reality the true test for success in de-identification is as much about the additional controls placed on the environment and the recipients of the data, as it is about the actual de-identification techniques applied.

This means that legal, risk and compliance professionals need to be involved, as well as the data scientists. “Don’t worry, it’s de-identified” should never be accepted as a complete legal strategy.

A case closed, but not settled

By way of context: in September 2024, Cam Wilson, technology journalist at Crikey, published a story titled “Australia’s biggest medical imaging lab is training AI on its scan data. Patients have no idea.”

Prompted by Wilson’s story, the OAIC began a preliminary inquiry into the disclosure of medical imaging scans by I-MED to Annalise.ai, a former joint venture between I-MED and Harrison.ai, a healthcare artificial intelligence company. The inquiry was commenced to determine whether a formal investigation should be opened.

The reason the OAIC closed its preliminary inquiry in April 2025, without commencing a formal investigation, was that “the Commissioner was satisfied that the patient data shared with Annalise.ai was de-identified sufficiently that it was no longer personal information for the purposes of the Privacy Act.”

The OAIC concluded that: “Although the steps taken by I-MED could not entirely remove the risk of re-identification, the Commissioner was satisfied that it reduced that risk to a sufficiently low level and was supported by sound data governance practices.”

In other words: I-MED was found to have sufficiently de-identified what had been ‘personal information’, to the point where it was no longer personal information, and thus the Australian Privacy Principles (APPs) did not apply. Hence: no need for patient notice, let alone consent.

However, the OAIC’s report about their preliminary inquiries into I-MED raises yet more questions.

We do acknowledge that there may be proprietary reasons for not going into further detail about the de-identification techniques used by I-MED. However as a result, regulated entities cannot use this report as a straight-forward blueprint for success.

Rather than seeing this report as a permissive green flag, the next time you hear “don’t worry, it’s de-identified,” you should still be raising a red flag to pause and review your risk before proceeding.

Next steps for organisations

The following are key issues we suggest addressing in a comprehensive Privacy Impact Assessment (PIA) for any proposal involving the use of de-identified data to train an AI model.

Is the act of de-identification itself lawful?

The OAIC report did not address whether the initial act of using the health information in order to de-identify it (i.e. the use prior to the disclosure) was itself compliant with APP 6; the OAIC’s concern in this inquiry was limited to the potential disclosure, after the de-identification techniques had been applied.

Other OAIC guidance on de-identification states that the use of data in order to de-identify it may “generally” be considered a ‘use’ allowed under the ‘directly related secondary purpose’ exception under APP 6.2(a).

Is de-identifying for the purpose of disclosure to a third party, in order to train an AI model, one of those circumstances? Does it matter who the third party is, what the resulting AI tool will be used for, who benefits, or what expectations were set with the original data subjects about this use case?

How does the nature of the dataset impact our risk exposure?

K-anonymity, which is one of the methodologies used to measure the ‘success’ of de-identification, is built on the premise that for every individual in a dataset, there is only one record. An example would be a census-type snapshot of a population at one point in time.

But what if the dataset is longitudinal? If the dataset covers a long span of time, patients may be represented more than once in the dataset. For example, a patient may have more than one medical procedure over the time span covered. The fact that hashed patient ID numbers were present in the I-MED dataset that was supplied to Annalise.ai suggests that the dataset was longitudinal, and that the end users wanted to be able to match records from different dates pertaining to the same patient. (Hashing is a way to enable matching and linkage of records without exposing the underlying raw data to view.)

Also, the report states that “30 million patient studies” and associated diagnostic reports were provided to Annalise.ai between 2020 and 2022, suggesting it was more than a point-in-time snapshot.

Longitudinal datasets pose greater re-identification risk than cross-sectional (snapshot in time) datasets. How is this risk to be managed?

How should we test for ‘success’ in de-identification?

Should we be testing simply for leakage of direct identifiers, or something more nuanced?

The report did not squarely address the question of uniqueness in the dataset, in addition to whether or not identity itself could be exposed. For example, a combination of demographic and clinical data might serve to make some patients unique, and thus able to be ‘singled out’. (Since 2017, the OAIC has been at pains to point out – and in 2023 the Government accepted – that the legal test under the Privacy Act is that an individual is ‘identifiable’ if they can be distinguished from all others in a group, even if their identity is not known.)

Indeed, demographic data alone might be sufficient to ‘single out’ outlier patients, for example if you combine gender, age, geographic location (such as postcode or local government area), and ethnicity or country of birth.

What is the trade-off in terms of data utility?

The OAIC report mentions “top and bottom coding”, and “aggregating certain fields into large cohorts to avoid identification of outliers.”

However, if you decide to hide statistical outliers to control for privacy risk, you may find that you lose utility in the data. This can undermine the very purpose of establishing your training data set.

For example, if you were to remove indigenous status as a data field, you may not be able to determine whether the resulting diagnostic model appropriately recognises medical conditions for Aboriginal or Torres Strait Islander patients. Similarly, removing statistical outliers in relation to the clinical data may impact on the quality of the model developed, such that it is unable to diagnose rare conditions, or common conditions when they appear in patients at an unusual age.

Even if you have robustly controlled for both direct and indirect identifiers and clinical outliers via de-identification techniques, other attribute data may itself be rich enough that some individuals will be unique in the dataset, and thus ‘identifiable’ in law.

For example, even without any direct or indirect identifiers or other clinical information about a patient, a combination of event dates within a longitudinal dataset (such as the date/s when the patient visited a GP, had an x-ray, was admitted to hospital, underwent surgery or was dispensed a medication) can itself render a patient unique in a dataset.

Have the most appropriate de-identification techniques been applied?

The report mentions hashing of names, addresses and phone numbers. Given there were patient ID numbers included, why were these details not removed (supressed) altogether?

Can our de-identification be undone?

Although the report mentions ‘time-shifting’ of event dates (i.e. data perturbation) as one of the de-identification techniques used, we have seen examples where a carefully calibrated de-identification of event dates has been unwound by the inclusion of other dates. For example, if the way a training dataset is being created includes a daily feed of new records, the date of the record upload might effectively give away the ‘obscured’ event date.

What is the context for the release?

The legal test for de-identification considers not only how the data has been treated, but also the context into which it will be released. OAIC guidance on de-identification from 2018 notes that the judgment must be made with respect to “the relevant release context.”

In other words, the data (post-application of de-identification techniques) must still be considered in context. Release to the world at large is a very different context to a release within a tightly controlled tech environment, to a limited number of people, subject to legal, administrative and technical restrictions.

In that context, the data will only be considered to no longer be ‘personal information’ (and thus, the privacy rules will no longer apply), if identifying an individual – which includes being able to distinguish one individual from the group – “is so impractical that there is almost no likelihood of it occurring.”

While the report notes that I-MED had put in place contractual controls to prohibit Annalise.ai “from doing any act, or engaging in any practice, that would result in the patient data becoming ‘reasonably identifiable” (including prohibiting “disclosing or publishing the patient data for any purpose”), the report does not explicitly address the ability of the recipients to match the dataset with other publicly available data, such as patient records found on the dark web as a result of past data breaches. Yet this is a matter that should be included in any privacy risk assessment. Contractual controls have their place, but so too do technical controls, such as the use of secure data enclaves which prevent any importing or exporting of other data.

Have we tested for re-identification risk?

Just as you might hire a ‘white hat hacker’ to conduct penetration testing of your information security controls, you need to test that your de-identification approach will stand up to external attack from a motivated intruder.

The OAIC report notes that “I-MED and Annalise.ai provided samples of image scans and other patient data used. A review of these samples by OAIC staff revealed no identifiable personal information.”

But were those OAIC staff re-identification experts? When we conduct PIAs for our clients, we may bring in an additional consultant who is a world-leading expert in conducting re-identification attacks. (It’s a very niche expertise. We don’t pretend to be able to do it ourselves.)

If it’s not de-identified, then what?

What’s the plan if the information can’t be ‘sufficiently de-identified’, without trading off so much data utility as to render the exercise futile?

Entities wishing to use or disclose personal information to train AI must find a way to comply with APP 6 (or their local equivalent), or cease their project. While medical research projects may be able to meet the complex public interest tests necessary to apply a research exception to APP 6, other use cases may not. Your project may be prohibited, in the absence of consent.

Do we have social licence to proceed?

Finally: the I-MED case teaches us a valuable lesson about the importance of meeting community expectations, which may set a higher standard than the law.

According to the ACCC, 83% of Australians agree that “companies should seek user consent before using their data to train AI models.”

Notwithstanding the finding by the Privacy Commissioner that the information was effectively de-identified in context (such that no patient consent was required for the disclosure to occur), based on a review of their online booking process, I-MED now appears to be seeking express consent for the use of patient data from x-rays, even in a de-identified form, to develop models for artificial intelligence.

Upon booking an x-ray recently, I was given the option to tick a box which stated: “I consent to the de-identification of certain personal information (including my age, gender, type of scan, images and report), to design, build and train AI models aimed at enhancing diagnostic accuracy and future patient outcomes by supporting clinical decision making.”

The consent form presented upon booking made clear that this consent request was “optional.”. Indeed, I was pleased to discover that it was unbundled from other consent requests, and the default was unticked. Reassuringly, the form also stated up-front that this consent request “Will not impact your service… The form linked to another page which explains “How is AI used to improve care?”

The OAIC’s preliminary inquiries commenced in an environment in which media commentary was focussed on the claim that patients had not been informed or provided their consent to the use of their records to train AI. This shift in business practices to obtain consent, even after I-MED had convinced the regulator of their legal argument that they did not need consent, points to the critical importance not only of legal compliance, but also of ensuring that you have social licence to proceed.

Because, as the Privacy Commissioner warns, “the efficiency and productivity dividends of AI will not be realised if AI tools don’t enjoy the trust and confidence of the Australian public.”

Join us on 20 August to learn more about de-identification in our free webinar: De-identification demystified for GRC, legal and privacy professionals.

Why “Don’t worry it’s de-identified” should (still) be a red flag when considering privacy risk

A case closed, but not settled

Next steps for organisations

CONTACT US

Subscribe to our newsletter.