-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write checksum to filecache table #13397
Write checksum to filecache table #13397
Conversation
Signed-off-by: Tomasz Grobelny <[email protected]>
Ref #11138 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A naive approach like this will be terrible inefficient when dealing with large or remote files as the file will have to be downloaded from the remote again to calculate the checksum.
The only efficient way to get the checksum is to re-use the upload stream and calculate the checksum from that.
Also we should make it possible to use different algorithms, or at least encode the algorithm into the checksum to be able to change it in the future. |
Signed-off-by: Tomasz Grobelny <[email protected]>
@icewind1991 How about this approach (see my new commit). Now it calculates md5 sum on the fly while uploading file. Implementation is obviously not yet complete (eg. need to work on chunked upload and restructure the code), but I wanted to ask if the approach would be acceptable? |
Just to make it clear: something like "md5:290dbe614e79b16b6204975614939c8e" would be ok? |
Yep - something like that I had in mind. |
The method of calculating the hash is good. Keep in mind that there are apps that use the lower level api's to write file data, any logic that generates checksums, will have to ensure that the checksum is either re-calculated or invalidates on every write. This is one of the main reasons why this hasn't been added yet, if we end up missing a single code path that writes to the file and forget to update the checksum it'll lead to nextcloud thinking the file is corrupted. |
Can you suggest any paths/apps that should be checked? What comes to my mind is file copy/move in files app, upload, chunked upload and editing txt file in text editor app. Anything more?
Well, for now nothing actually uses those checksums so wrong checksum (calculated again on download) could be treated as internal warning and then number of such warnings could be treated as part of usage statistics. Would that make sense? |
Signed-off-by: Tomasz Grobelny <[email protected]>
Signed-off-by: Tomasz Grobelny <[email protected]>
Now the code is somewhat cleaner and supports prefix. While writing the code I started wondering if having up to date checksums is viable at all given that files can be stored externally (smb, dav, etc.) and how far we should go when updating them. Right now I tried catching two paths when writing files + occ:files scan command, but obviously it may still happen that files will be updated outside of nextcloud. Should we try to recalculate hash when reading files (download, generating previews, etc.)? BTW, what is chunked file upload in nextcloud? In code I see it depends on $_SERVER['HTTP_OC_CHUNKED'] but this variable is set only in tests which to my untrained eye makes this code dead in any real-life scenario (and can be removed). Or am I wrong? Anyway, when uploading 14MB file it was sent in chunks of 10MB+4MB, but it didn't use the chunking code in apps/dav/lib/Connector/Sabre/File.php
After further thought - no :-) As mentioned above - we will never get 100% accuracy. |
Signed-off-by: Tomasz Grobelny <[email protected]>
The problem with SMB is, that it could be that there are multiple TB or tens of TB stored in the SMB storage. So a calculation for those files is not really something we should trigger at all. Also there the files can update without Nextcloud getting any hint there. So keeping checksums for the storage types that can also be accessed without Nextcloud in between does not make any sense. So this setting (for if the checksum is calculated or not) should be on a per storage basis. So that you can also say: yes it is SMB, but it is used solely via Nextcloud and not directly. |
1.允许文件的哈希值为空。
|
This means the I saw cases where in office all users add the same external storage, used to store all office-wide shared files. This leads to all files being in cache in duplicates by the amount of users. When one change is done, the database is accordingly changed on all those entries the same way. Of course this is simply an inefficient way of sharing files throughout Nextcloud, but these are edge cases where the So what I want to say by this: If there is any chance, IMO this table should hold less information by splitting it, instead of adding more and increasing the mentioned "issues" by this. |
@tomasz-grobelny could you rebase the branch? |
Closing due to lack of activity and because this does not appear like a patch we want to take with our file index. |
Currently the filecache table contains checksum column but it is empty. This pull request updates value in this column during file modification and using occ files:scan.
Information about file checksum can be used eg. for finding duplicate files.