-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate uploads #35
Comments
We can defer to this when we have thought more about hashing, which is tied into caching |
Decision:
|
Is it ok the hash of the string representation of a float? What should be the precision of the truncated float? Or is there any other way to do the truncation more efficiently? @gabor1 Do you want a structure-specific (which only based on |
I think both hashes are interesting, one that is just position/cell/pbc based (remember that many input xyzzy don't have the pic key set), and also one that incorporates everything. in the upload functionality, it would be nice to have options that allow me to get warnings if the structure is the same, or if the entire config is the same, and to take various actions |
the corresponding derived quantities could be "hash_structure" and "hash". the upload options that I think would be useful: --warn-duplicate, --warn-duplicate structure, --ignore-duplicate |
Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes
|
um… what’s the point of a hash if it gives different values for the same thing… there must be a way to use ot so that it gives the same value
…-- Gábor
Gábor Csányi
Professor of Molecular Modelling
Engineering Laboratory
Pembroke College
University of Cambridge
Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/
On 21 Oct 2019, at 14:30, Adam Fekete ***@***.***> wrote:
Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes
print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
maybe you want an md5 hash?
…-- Gábor
Gábor Csányi
Professor of Molecular Modelling
Engineering Laboratory
Pembroke College
University of Cambridge
Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/
On 21 Oct 2019, at 14:31, Gabor Csanyi ***@***.***> wrote:
um… what’s the point of a hash if it gives different values for the same thing… there must be a way to use ot so that it gives the same value
-- Gábor
Gábor Csányi
Professor of Molecular Modelling
Engineering Laboratory
Pembroke College
University of Cambridge
Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/
> On 21 Oct 2019, at 14:30, Adam Fekete ***@***.***> wrote:
>
> Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes
>
> print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or unsubscribe.
>
|
I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in Python uses int64 numbers for the hash values:
It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos. |
I agree it doesn't need to be cryptographic. But didn't you say that the hash() function is not deterministic?
…-- Gábor
On 21 Oct 2019, at 23:16, Adam Fekete ***@***.***> wrote:
I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in hashlib. In our case, two close number can have close hash values...
Python uses int64 numbers for the hash values:
Out[3]: type(hash('adsgkjdskgljsdlkjdg4'))
Out[3]: int
In [4]: hash(1), type(hash(1))
Out[4]: (1, int)
In [5]: hashlib.md5(bytes(1)).hexdigest()
Out[5]: '93b885adfe0da089cdf634904fd59f71'
It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
As far as I know, it is not deterministic only for the strings. integer floats are ok.
|
i found this
https://stackoverflow.com/questions/27954892/deterministic-hashing-in-python-3
they say use a different hash... maybe there are cheaper ones than md5
…-- Gábor
On 21 Oct 2019, at 23:25, Adam Fekete ***@***.***> wrote:
As far as I know, it is not deterministic only for the strings. integer floats are ok.
Here you can see the previously commented example:
In [8]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
...:
1 3713081631934410656 322818021289917443 -8637518414175570336
In [9]:
Do you really want to exit ([y]/n)? ^D
$ ipython
Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
1 3713081631934410656 322818021289917443 -6054905812727825455
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The Did I miss something because the rest is just the same reference to python If I remember right md5 is 128 bit long but we must convert everything into byte arrays... Or it is possible that there is a completely different valid solution like generating a JSON text stream and hashing that one by md5. (of course, the keys must be ordered ...) |
This might be interesting: |
maybe stick with md5 then
…-- Gábor
On 22 Oct 2019, at 00:10, Adam Fekete ***@***.***> wrote:
This might be interesting:
https://amp.readthedocs.io/en/latest/useamp.html#advanced-use
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The current version of hashing has been implemented using MD5. For the future, I recommend using a non-cryptographic hash to get better performance if it is possible to solve the issue with the strings. About the command options (--warn-duplicate, --warn-duplicate structure, --ignore-duplicate, --dry), what should be the default behaviour for the upload command? |
how hard would it be to count the duplicates all the time, and always report them (separately report number of structural duplicates and complete duplicates), and by default as a y/n question whether to proceed with the upload or not? This means no --warn* options, but if the --drop-duplicates option is given, then duplicates are NOT uploaded, and if the --upload-duplicates option is explicitly given, then they are uploaded ? I guess if questioning the user is hard, then we can just quit, and if the above options are given, then we execute accordingly |
It is not harder than the others. Of course, you have to pay a performance penalty about a factor of 3. You need to iterate/convert everything twice and checking the hashes one by one on the whole database. |
ok, how about this. let's not do questions etc. default behaviour: check for duplicates, and if there are none, then do the upload (second pass on the file). If there are, then report "X duplicate structures found (different metadata). Y duplicate configurations found (including metadata)", and exit without uploading anything. Then the following options can be given.
let me know if you see a problem with this plan! |
It looks ok for me. Just to make sure the |
yes
…-- Gábor
Gábor Csányi
Professor of Molecular Modelling
Engineering Laboratory
Pembroke College
University of Cambridge
Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/
On 24 Oct 2019, at 17:10, Adam Fekete ***@***.***> wrote:
It looks ok for me. Just to make sure the --upload-duplicates-replace basically means that we remove all the items with matching hash_structure and add the new one, isn't? The new one will get its own new unique id.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Just to make use:
in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading.
So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one. What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation? |
Good point. I think we just upload the new file and if it has duplicates, fine, they will go into the database
…-- Gábor
On 2 Nov 2019, at 03:14, Adam Fekete ***@***.***> wrote:
Just to make use:
if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata
in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading.
if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before
So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one.
What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation?
If they not have been stored in the DB before the check will miss them. (It is another factor of N to check them properly)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Do we want to prevent or at least warn about duplicate uploads? E.g. check some hash, and if it's really the same, then at least warn the user before committing the upload?
The text was updated successfully, but these errors were encountered: