Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate uploads #35

Open
gabor1 opened this issue Jun 27, 2019 · 24 comments
Open

Duplicate uploads #35

gabor1 opened this issue Jun 27, 2019 · 24 comments
Labels
enhancement New feature or request

Comments

@gabor1
Copy link
Contributor

gabor1 commented Jun 27, 2019

Do we want to prevent or at least warn about duplicate uploads? E.g. check some hash, and if it's really the same, then at least warn the user before committing the upload?

@gabor1 gabor1 added the enhancement New feature or request label Jun 28, 2019
@gabor1
Copy link
Contributor Author

gabor1 commented Jun 28, 2019

We can defer to this when we have thought more about hashing, which is tied into caching

@fekad fekad added major new feature and removed enhancement New feature or request labels Jun 28, 2019
@fekad
Copy link
Contributor

fekad commented Oct 2, 2019

Decision:

  • creating a hash value from all the keys (array, info, results) and store it as derived value
  • the float values should be truncated
  • new command options:
    abcd upload --ignore-duplicates 
    abcd upload --remove_duplicates
    

@fekad
Copy link
Contributor

fekad commented Oct 21, 2019

Is it ok the hash of the string representation of a float? What should be the precision of the truncated float? Or is there any other way to do the truncation more efficiently?

@gabor1 Do you want a structure-specific (which only based on position, cell and pbc) hash value as well?

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019

I think both hashes are interesting, one that is just position/cell/pbc based (remember that many input xyzzy don't have the pic key set), and also one that incorporates everything. in the upload functionality, it would be nice to have options that allow me to get warnings if the structure is the same, or if the entire config is the same, and to take various actions

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019

the corresponding derived quantities could be "hash_structure" and "hash". the upload options that I think would be useful: --warn-duplicate, --warn-duplicate structure, --ignore-duplicate
there should be a way of calling the upload so that it is not actually executed, but the warnings are given (something like a --dry option?)

@fekad
Copy link
Contributor

fekad commented Oct 21, 2019

Unfortunately python - for security reasons - uses a random seed for the hash function. So a hash value of a string will be different between two python processes

print(hash(1), hash((1,2)), hash(3.14), hash('hello'))

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019 via email

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019 via email

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019

@fekad
Copy link
Contributor

fekad commented Oct 21, 2019

I was thinking about md5 as well but I hoped there is a more elegant solution. We should not need to use a cryptographic hash like the ones in hashlib. In our case, two close number can have close hash values...

Python uses int64 numbers for the hash values:

Out[3]: type(hash('adsgkjdskgljsdlkjdg4'))
Out[3]: int

In [4]: hash(1), type(hash(1))
Out[4]: (1, int)

In [5]: hashlib.md5(bytes(1)).hexdigest()
Out[5]: '93b885adfe0da089cdf634904fd59f71'

It looks simple and nice and it has almost the same amount of possibilities that youtube uses for the whole world's videos.

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019 via email

@fekad
Copy link
Contributor

fekad commented Oct 21, 2019

As far as I know, it is not deterministic only for the strings. integer floats are ok.
Here you can see the previously commented example:

In [8]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
   ...:
1 3713081631934410656 322818021289917443 -8637518414175570336

In [9]:
Do you really want to exit ([y]/n)? ^D
$ ipython
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 14:38:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: print(hash(1), hash((1,2)), hash(3.14), hash('hello'))
1 3713081631934410656 322818021289917443 -6054905812727825455

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019 via email

@fekad
Copy link
Contributor

fekad commented Oct 21, 2019

The zlib.adler32() looks interesting bit 32 bit is not enough. You can easily have conflicts.
At this level, there is a huge difference between 32- and 64-bit hashes.

Did I miss something because the rest is just the same reference to python hashlib library?

If I remember right md5 is 128 bit long but we must convert everything into byte arrays...

Or it is possible that there is a completely different valid solution like generating a JSON text stream and hashing that one by md5. (of course, the keys must be ordered ...)

@fekad
Copy link
Contributor

fekad commented Oct 21, 2019

This might be interesting:
https://amp.readthedocs.io/en/latest/useamp.html#advanced-use

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 21, 2019 via email

@fekad
Copy link
Contributor

fekad commented Oct 22, 2019

The current version of hashing has been implemented using MD5. For the future, I recommend using a non-cryptographic hash to get better performance if it is possible to solve the issue with the strings.

About the command options (--warn-duplicate, --warn-duplicate structure, --ignore-duplicate, --dry), what should be the default behaviour for the upload command?
According to the flags, the upload command will raise an error if there are similar structures OR when the full hash is the same.

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 22, 2019

how hard would it be to count the duplicates all the time, and always report them (separately report number of structural duplicates and complete duplicates), and by default as a y/n question whether to proceed with the upload or not? This means no --warn* options, but if the --drop-duplicates option is given, then duplicates are NOT uploaded, and if the --upload-duplicates option is explicitly given, then they are uploaded ?

I guess if questioning the user is hard, then we can just quit, and if the above options are given, then we execute accordingly

@fekad
Copy link
Contributor

fekad commented Oct 22, 2019

It is not harder than the others. Of course, you have to pay a performance penalty about a factor of 3. You need to iterate/convert everything twice and checking the hashes one by one on the whole database.
The hashes need to be cached in the DB. I do not have too much experience in that field but maybe you know somebody...

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 24, 2019

ok, how about this. let's not do questions etc. default behaviour: check for duplicates, and if there are none, then do the upload (second pass on the file). If there are, then report "X duplicate structures found (different metadata). Y duplicate configurations found (including metadata)", and exit without uploading anything. Then the following options can be given.

  • if --upload-duplicates is given, you still report both types of duplicates, but still upload them (this can all be done in one pass, but if you prefer to do two for simplicity, that's fine)
  • if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata
  • if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before

let me know if you see a problem with this plan!

@fekad
Copy link
Contributor

fekad commented Oct 24, 2019

It looks ok for me. Just to make sure the --upload-duplicates-replace basically means that we remove all the items with matching hash_structure and add the new one, isn't? The new one will get its own new unique id.

@gabor1
Copy link
Contributor Author

gabor1 commented Oct 24, 2019 via email

@fekad fekad added enhancement New feature or request and removed major new feature labels Nov 1, 2019
@fekad
Copy link
Contributor

fekad commented Nov 2, 2019

Just to make use:

  • if --upload-structure-duplicates is given, then report both types of duplicates, but do not insert EXACT duplicates into the database (just drop them), but still insert structural duplicates that have different metadata

in this case, if there is an EXACT duplicate the derived properties (username, upload and modified date, etc...) will stay the old unchanged values. dropping != reuploading.

  • if --upload-duplicates-replace is given, then upload everything and duplicates overwrite previously existing data (this is useful for correcting errors), so resulting database should have no duplicates (neither structural nor exact) if it didn't have any before

So if the database already contained some structural duplicates all of them will be deleted before uploading the new one. eg: 10 calculation with the same structure but different "metadata" will be replaced by the new one.

What should happen if the file that you want to upload contains more than one calculation with the same structure or even multiples of the same calculation?
If they not have been stored in the DB before the check will miss them. (It is another factor of N to check them properly)

fekad pushed a commit that referenced this issue Nov 2, 2019
@gabor1
Copy link
Contributor Author

gabor1 commented Nov 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants