Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Anonymization and Pseudonymization #339

Closed
robertmand opened this issue Jan 18, 2021 · 5 comments
Closed

Data Anonymization and Pseudonymization #339

robertmand opened this issue Jan 18, 2021 · 5 comments
Assignees

Comments

@robertmand
Copy link
Contributor

robertmand commented Jan 18, 2021

What topic do you wish to add?
This page gives definitions of these terms and suggestions on how to achieve anonymization and pseudonymization of data.

Are there existing pages in the RDM toolkit website related to the requested page?
Pages around human sensitive data and GDPR.

Resources
If there are there resources that could be utilised for writing the new page, please list them below:

Context
If this request is coming from a particular project, domain, or use-case please list them below:
A couple of us wrote this at a previous contentathon in googledocs, and forgot to tell people it was there. SO ... I'm putting it in now

Here is the text:

Description
Data anonymization is the process of irreversibly modifying personal data in such a way that subjects cannot be identified directly or indirectly by anyone, including the study team. If data are anonymized, no one can link data back to the subject.

Pseudonymization is a process where identifying-fields within data records are replaced by artificial identifiers called pseudonyms or pseudonymized IDs. Pseudonymization ensures no one can link data back to the subject, apart from nominated members of the study team who will be able to link pseudonyms to identifying records, such as name and address.

Data anonymization involves modifying a dataset so that it is impossible to identify a subject from their data. Pseudonymization involves replacing identifying data with artificial IDs, for example, replacing a healthcare record ID with an internal participant ID only known to a named clinician working in the study.

Considerations

  • Both anonymization and pseudonymization are approaches that comply with the GDPR.
  • Simply removing identifiers cannot guarantee data anonymity. A dataset may contain unique traits/patterns that could identify individuals. An example of this would be recording 2 potentially unrelated attributes such as the instance of a rare disease and country of residence, where there is only a single case of this disease in this country.
  • Data that is anonymous currently may not be anonymous in the future. Future datasets on the same individual may disclose their identity.
  • Anonymization techniques can sometimes damage the statistical properties of the data, for example, translating current participant age into an age range.

Solutions

  • An example of pseudonymization is where participants in a study are assigned a non-identifying ID and all identifying data (such as name and address) are removed from the metadata to be shared. The mapping of this ID to personal data is held separately and securely by a named researcher who will not share this data.
  • There are well-established data anonymization approaches, such as k-anonymity, l-diversity, and differential privacy.

Relevant tools and resources

  • Amnesia

Thanasis Vergoulis [email protected]
Robert Andrews [email protected]

@pinarpink
Copy link
Contributor

IMO this content can initially go to Data Classification page. Perhaps we might emend the page title 'Data Classification and De-identification'. What say you @bedroesb @floradanna ?

@floradanna
Copy link
Collaborator

Yes, it could make sense. Data Classification so far has only 1 sub-problem (how to figure out if your data are sensitive or not). Maybe a second sub-problem could be " how to achieve anonymization and pseudonymization of sensitive data".

@bedroesb
Copy link
Member

do we need a new / different tag ?

@floradanna
Copy link
Collaborator

if the page is the same, I would not use an additional tag. It could complicate things. We better make use of keywords in this case.

@jmenglund
Copy link
Contributor

I agree with @pinarpink that the Data Classification page is currently the best place for the text. When adding the problem to that page, it is probably a good idea to also take a look at the other problem on that page, "Is my data sensitive?". Some of the bullets under considerations touch upon the same topic.

@smza smza self-assigned this Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants