-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate identifiers #2896
Deduplicate identifiers #2896
Conversation
…must be unique for the article's journal.
Closes #2848 |
src/identifiers/forms.py
Outdated
id_type=id_type, | ||
identifier=identifier, | ||
).exclude( | ||
article=self.article, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this would allow for any given article to still have duplicate identifiers. A way to avoid this would be to add an exclude on id=identifier.id
to_keep_pk = identifiers_of_type.filter(identifier=doi_string).values_list('id', flat=True)[0] | ||
duplicate_pks = identifiers_of_type.filter(identifier=doi_string).values_list('id', flat=True)[1:] | ||
duplicate_identifiers = [identifier for identifier in Identifier.objects.filter(pk__in=duplicate_pks)] | ||
print(id_type, to_keep_pk, doi_string) | ||
if duplicate_identifiers: | ||
print('\n\n\n') | ||
print('To keep:') | ||
print(id_type, to_keep_pk, doi_string) | ||
print('Duplicates:') | ||
for dup in duplicate_identifiers: | ||
print(id_type, dup.pk, dup.identifier) | ||
print('\n\n\n') | ||
Identifier.objects.filter(pk__in=duplicate_pks).delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need all these printing.
Also, we are keeping the first identifier for each duplicate... makes me wonder if this is a good approach, since we won't be able to detect which of these are mistaken. I would advocate for deleting duplicate IDs for a given article but refrain from doing this for duplicates across an entire journal in case we wipe a correct identifier while preserving the wrong one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more change
src/identifiers/forms.py
Outdated
).exclude( | ||
article=self.article, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make it so that identifier edits won't be valid (since the query will return the identifier being edited)
A solution to this would be:
if self.instance and self.instance.id:
idents = idents.exclude(id=self.instance.id)
We first tried to solve this with a unique constraint like this:
But this would not account for identifiers that are the same for several journals, which tend to occur for imported content.
So we tried adding the journal to the unique constraint. But
journal
is not an attribute ofIdentifier
, and the unique constraints can't handle chaining foreign keys together as witharticle__journal
.So we set out to define a custom
validate_unique
function, but this is quite complex as the default function is quite complex.At this point the only completed work that we will likely want to keep is the migration that will find and remove duplicates in previous data.