Data Anonymization and Pseudonymization #339

robertmand · 2021-01-18T13:42:34Z

What topic do you wish to add?
This page gives definitions of these terms and suggestions on how to achieve anonymization and pseudonymization of data.

Are there existing pages in the RDM toolkit website related to the requested page?
Pages around human sensitive data and GDPR.

Resources
If there are there resources that could be utilised for writing the new page, please list them below:

Context
If this request is coming from a particular project, domain, or use-case please list them below:
A couple of us wrote this at a previous contentathon in googledocs, and forgot to tell people it was there. SO ... I'm putting it in now

Here is the text:

Description
Data anonymization is the process of irreversibly modifying personal data in such a way that subjects cannot be identified directly or indirectly by anyone, including the study team. If data are anonymized, no one can link data back to the subject.

Pseudonymization is a process where identifying-fields within data records are replaced by artificial identifiers called pseudonyms or pseudonymized IDs. Pseudonymization ensures no one can link data back to the subject, apart from nominated members of the study team who will be able to link pseudonyms to identifying records, such as name and address.

Data anonymization involves modifying a dataset so that it is impossible to identify a subject from their data. Pseudonymization involves replacing identifying data with artificial IDs, for example, replacing a healthcare record ID with an internal participant ID only known to a named clinician working in the study.

Considerations

Both anonymization and pseudonymization are approaches that comply with the GDPR.
Simply removing identifiers cannot guarantee data anonymity. A dataset may contain unique traits/patterns that could identify individuals. An example of this would be recording 2 potentially unrelated attributes such as the instance of a rare disease and country of residence, where there is only a single case of this disease in this country.
Data that is anonymous currently may not be anonymous in the future. Future datasets on the same individual may disclose their identity.
Anonymization techniques can sometimes damage the statistical properties of the data, for example, translating current participant age into an age range.

Solutions

An example of pseudonymization is where participants in a study are assigned a non-identifying ID and all identifying data (such as name and address) are removed from the metadata to be shared. The mapping of this ID to personal data is held separately and securely by a named researcher who will not share this data.
There are well-established data anonymization approaches, such as k-anonymity, l-diversity, and differential privacy.

Relevant tools and resources

Amnesia

Thanasis Vergoulis [email protected]
Robert Andrews [email protected]

pinarpink · 2021-01-18T15:36:17Z

IMO this content can initially go to Data Classification page. Perhaps we might emend the page title 'Data Classification and De-identification'. What say you @bedroesb @floradanna ?

floradanna · 2021-01-18T15:52:06Z

Yes, it could make sense. Data Classification so far has only 1 sub-problem (how to figure out if your data are sensitive or not). Maybe a second sub-problem could be " how to achieve anonymization and pseudonymization of sensitive data".

bedroesb · 2021-01-18T16:06:08Z

do we need a new / different tag ?

floradanna · 2021-01-18T16:09:53Z

if the page is the same, I would not use an additional tag. It could complicate things. We better make use of keywords in this case.

jmenglund · 2021-01-19T18:29:14Z

I agree with @pinarpink that the Data Classification page is currently the best place for the text. When adding the problem to that page, it is probably a good idea to also take a look at the other problem on that page, "Is my data sensitive?". Some of the bullets under considerations touch upon the same topic.

robertmand added the new page request label Jan 18, 2021

smza self-assigned this Jan 26, 2021

floradanna closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Anonymization and Pseudonymization #339

Data Anonymization and Pseudonymization #339

robertmand commented Jan 18, 2021 •

edited

Loading

pinarpink commented Jan 18, 2021

floradanna commented Jan 18, 2021

bedroesb commented Jan 18, 2021

floradanna commented Jan 18, 2021

jmenglund commented Jan 19, 2021

Data Anonymization and Pseudonymization #339

Data Anonymization and Pseudonymization #339

Comments

robertmand commented Jan 18, 2021 • edited Loading

pinarpink commented Jan 18, 2021

floradanna commented Jan 18, 2021

bedroesb commented Jan 18, 2021

floradanna commented Jan 18, 2021

jmenglund commented Jan 19, 2021

robertmand commented Jan 18, 2021 •

edited

Loading