Duplicate uploads #35

gabor1 · 2019-06-27T20:37:32Z

Do we want to prevent or at least warn about duplicate uploads? E.g. check some hash, and if it's really the same, then at least warn the user before committing the upload?

gabor1 · 2019-06-28T18:00:41Z

We can defer to this when we have thought more about hashing, which is tied into caching

fekad · 2019-10-02T00:04:24Z

Decision:

creating a hash value from all the keys (array, info, results) and store it as derived value
the float values should be truncated

new command options:

abcd upload --ignore-duplicates 
abcd upload --remove_duplicates

fekad · 2019-10-21T12:20:28Z

Is it ok the hash of the string representation of a float? What should be the precision of the truncated float? Or is there any other way to do the truncation more efficiently?

@gabor1 Do you want a structure-specific (which only based on position, cell and pbc) hash value as well?

gabor1 · 2019-10-21T12:25:51Z

I think both hashes are interesting, one that is just position/cell/pbc based (remember that many input xyzzy don't have the pic key set), and also one that incorporates everything. in the upload functionality, it would be nice to have options that allow me to get warnings if the structure is the same, or if the entire config is the same, and to take various actions

gabor1 · 2019-10-21T12:28:20Z

the corresponding derived quantities could be "hash_structure" and "hash". the upload options that I think would be useful: --warn-duplicate, --warn-duplicate structure, --ignore-duplicate
there should be a way of calling the upload so that it is not actually executed, but the warnings are given (something like a --dry option?)

fekad · 2019-10-21T13:30:27Z

Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes

print(hash(1), hash((1,2)), hash(3.14), hash('hello'))

gabor1 · 2019-10-21T13:31:25Z

um… what’s the point of a hash if it gives different values for the same thing… there must be a way to use ot so that it gives the same value

…

-- Gábor Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 21 Oct 2019, at 14:30, Adam Fekete ***@***.***> wrote: Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes print(hash(1), hash((1,2)), hash(3.14), hash('hello')) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 · 2019-10-21T13:31:41Z

maybe you want an md5 hash?

…

-- Gábor Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 21 Oct 2019, at 14:31, Gabor Csanyi ***@***.***> wrote: um… what’s the point of a hash if it gives different values for the same thing… there must be a way to use ot so that it gives the same value -- Gábor Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/ > On 21 Oct 2019, at 14:30, Adam Fekete ***@***.***> wrote: > > Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes > > print(hash(1), hash((1,2)), hash(3.14), hash('hello')) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or unsubscribe. >

gabor1 · 2019-10-21T13:32:45Z

https://docs.python.org/2/library/hashlib.html#module-hashlib

fekad · 2019-10-21T22:16:18Z

I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in hashlib. In our case, two close number can have close hash values...

Python uses int64 numbers for the hash values:

Out[3]: type(hash('adsgkjdskgljsdlkjdg4'))
Out[3]: int

In [4]: hash(1), type(hash(1))
Out[4]: (1, int)

In [5]: hashlib.md5(bytes(1)).hexdigest()
Out[5]: '93b885adfe0da089cdf634904fd59f71'

It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos.

gabor1 · 2019-10-21T22:19:20Z

I agree it doesn't need to be cryptographic. But didn't you say that the hash() function is not deterministic?

…

-- Gábor

On 21 Oct 2019, at 23:16, Adam Fekete ***@***.***> wrote: I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in hashlib. In our case, two close number can have close hash values... Python uses int64 numbers for the hash values: Out[3]: type(hash('adsgkjdskgljsdlkjdg4')) Out[3]: int In [4]: hash(1), type(hash(1)) Out[4]: (1, int) In [5]: hashlib.md5(bytes(1)).hexdigest() Out[5]: '93b885adfe0da089cdf634904fd59f71' It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad · 2019-10-21T22:25:35Z

As far as I know, it is not deterministic only for the strings. integer floats are ok.
Here you can see the previously commented example:

In [8]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
   ...:
1 3713081631934410656 322818021289917443 -8637518414175570336

In [9]:
Do you really want to exit ([y]/n)? ^D
$ ipython
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 14:38:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
1 3713081631934410656 322818021289917443 -6054905812727825455

gabor1 · 2019-10-21T22:42:24Z

i found this https://stackoverflow.com/questions/27954892/deterministic-hashing-in-python-3 they say use a different hash... maybe there are cheaper ones than md5

…

-- Gábor

On 21 Oct 2019, at 23:25, Adam Fekete ***@***.***> wrote: As far as I know, it is not deterministic only for the strings. integer floats are ok. Here you can see the previously commented example: In [8]: print(hash(1), hash((1,2)), hash(3.14), hash('hello')) ...: 1 3713081631934410656 322818021289917443 -8637518414175570336 In [9]: Do you really want to exit ([y]/n)? ^D $ ipython Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56) Type 'copyright', 'credits' or 'license' for more information IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: print(hash(1), hash((1,2)), hash(3.14), hash('hello')) 1 3713081631934410656 322818021289917443 -6054905812727825455 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad · 2019-10-21T23:03:49Z

The zlib.adler32() looks interesting bit 32 bit is not enough. You can easily have conflicts.
At this level, there is a huge difference between 32- and 64-bit hashes.

Did I miss something because the rest is just the same reference to python hashlib library?

If I remember right md5 is 128 bit long but we must convert everything into byte arrays...

Or it is possible that there is a completely different valid solution like generating a JSON text stream and hashing that one by md5. (of course, the keys must be ordered ...)

fekad · 2019-10-21T23:10:21Z

This might be interesting:
https://amp.readthedocs.io/en/latest/useamp.html#advanced-use

gabor1 · 2019-10-21T23:32:32Z

maybe stick with md5 then

…

-- Gábor

On 22 Oct 2019, at 00:10, Adam Fekete ***@***.***> wrote: This might be interesting: https://amp.readthedocs.io/en/latest/useamp.html#advanced-use — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad · 2019-10-22T15:21:46Z

The current version of hashing has been implemented using MD5. For the future, I recommend using a non-cryptographic hash to get better performance if it is possible to solve the issue with the strings.

About the command options (--warn-duplicate, --warn-duplicate structure, --ignore-duplicate, --dry), what should be the default behaviour for the upload command?
According to the flags, the upload command will raise an error if there are similar structures OR when the full hash is the same.

gabor1 · 2019-10-22T17:55:03Z

how hard would it be to count the duplicates all the time, and always report them (separately report number of structural duplicates and complete duplicates), and by default as a y/n question whether to proceed with the upload or not? This means no --warn* options, but if the --drop-duplicates option is given, then duplicates are NOT uploaded, and if the --upload-duplicates option is explicitly given, then they are uploaded ?

I guess if questioning the user is hard, then we can just quit, and if the above options are given, then we execute accordingly

fekad · 2019-10-22T19:57:17Z

It is not harder than the others. Of course, you have to pay a performance penalty about a factor of 3. You need to iterate/convert everything twice and checking the hashes one by one on the whole database.
The hashes need to be cached in the DB. I do not have too much experience in that field but maybe you know somebody...

gabor1 · 2019-10-24T12:52:16Z

ok, how about this. let's not do questions etc. default behaviour: check for duplicates, and if there are none, then do the upload (second pass on the file). If there are, then report "X duplicate structures found (different metadata). Y duplicate configurations found (including metadata)", and exit without uploading anything. Then the following options can be given.

if --upload-duplicates is given, you still report both types of duplicates, but still upload them (this can all be done in one pass, but if you prefer to do two for simplicity, that's fine)
if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata
if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before

let me know if you see a problem with this plan!

fekad · 2019-10-24T16:10:25Z

It looks ok for me. Just to make sure the --upload-duplicates-replace basically means that we remove all the items with matching hash_structure and add the new one, isn't? The new one will get its own new unique id.

gabor1 · 2019-10-24T17:11:47Z

yes

…

-- Gábor Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 24 Oct 2019, at 17:10, Adam Fekete ***@***.***> wrote: It looks ok for me. Just to make sure the --upload-duplicates-replace basically means that we remove all the items with matching hash_structure and add the new one, isn't? The new one will get its own new unique id. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

fekad · 2019-11-02T03:14:33Z

Just to make use:

if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata

in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading.

if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before

So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one.

What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation?
If they not have been stored in the DB before the check will miss them. (It is another factor of N to check them properly)

gabor1 · 2019-11-02T09:01:24Z

Good point. I think we just upload the new file and if it has duplicates, fine, they will go into the database

…

-- Gábor

On 2 Nov 2019, at 03:14, Adam Fekete ***@***.***> wrote: Just to make use: if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading. if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one. What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation? If they not have been stored in the DB before the check will miss them. (It is another factor of N to check them properly) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 added the enhancement New feature or request label Jun 28, 2019

fekad added major new feature and removed enhancement New feature or request labels Jun 28, 2019

fekad pushed a commit that referenced this issue Oct 22, 2019

implement hash function (string-based MD5) (#35)

9608056

fekad added enhancement New feature or request and removed major new feature labels Nov 1, 2019

fekad pushed a commit that referenced this issue Nov 2, 2019

adding upload options (#35)

982f580

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate uploads #35

Duplicate uploads #35

gabor1 commented Jun 27, 2019

gabor1 commented Jun 28, 2019

fekad commented Oct 2, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019

gabor1 commented Oct 21, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

gabor1 commented Oct 21, 2019 via email

gabor1 commented Oct 21, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

fekad commented Oct 21, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

fekad commented Oct 22, 2019

gabor1 commented Oct 22, 2019

fekad commented Oct 22, 2019

gabor1 commented Oct 24, 2019

fekad commented Oct 24, 2019

gabor1 commented Oct 24, 2019 via email

fekad commented Nov 2, 2019

gabor1 commented Nov 2, 2019 via email

Duplicate uploads #35

Duplicate uploads #35

Comments

gabor1 commented Jun 27, 2019

gabor1 commented Jun 28, 2019

fekad commented Oct 2, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019

gabor1 commented Oct 21, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

gabor1 commented Oct 21, 2019 via email

gabor1 commented Oct 21, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

fekad commented Oct 21, 2019

fekad commented Oct 21, 2019

gabor1 commented Oct 21, 2019 via email

fekad commented Oct 22, 2019

gabor1 commented Oct 22, 2019

fekad commented Oct 22, 2019

gabor1 commented Oct 24, 2019

fekad commented Oct 24, 2019

gabor1 commented Oct 24, 2019 via email

fekad commented Nov 2, 2019

gabor1 commented Nov 2, 2019 via email