Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add buffersize argument to rdfind #178

Closed
wants to merge 3 commits into from

Conversation

trollkarlen
Copy link
Contributor

add the argument to change buffer size of checksum calculation, may increase speed on some checksum algorithms and disk types.

Copy link
Owner

@pauldreik pauldreik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stretch goal:

I am thinking of what the optimal buffersize is. It has been 19 years since I wrote rdfind, and hardware has changed quite a bit since then.
It would be good to have data on

  • a fast typical desktop (amd64 cpu, fast ssd)
  • a smaller system (raspberry pi3, sd card and 32 bit?)
  • a server (pretty similar to the desktop I assume, but storage is remote)

also, what is good for small files and what is good for large files?

Fileinfo.cc Outdated
@@ -80,7 +81,7 @@ Fileinfo::fillwithbytes(enum readtobuffermode filltype,
if (checksumtype != Checksum::checksumtypes::NOTSET) {
Checksum chk(checksumtype);

char buffer[4096];
char buffer[buffersize];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a variable length array - not standard C++.

I think it makes sense to only allocate it once. This is C++17, so we can't use std::span.

Please make a std::vector outside the loop, sized to buffersize and pass it by reference into this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like having this on the stack because of ease and also if this becomes multi-threaded sometime somewhere in the future.
I atleast changed to std::vector, but c++ is not my stong side so please review carefully :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its kind of weird that the variable length array passes all the do_quality_checks.sh when its not in the c++ standard :)

rdfind.cc Outdated Show resolved Hide resolved
rdfind.cc Outdated Show resolved Hide resolved
@pauldreik
Copy link
Owner

pauldreik commented Jan 22, 2025

also, the title of the MR and the commit do not need to say it is rdfind, that is pretty implicit 😃

@@ -65,6 +65,10 @@ usage()
"device and inode\n"
<< " -checksum md5 |(sha1)| sha256 | sha512\n"
<< " checksum type\n"
<< " -buffersize N (N=4096,0<N<1000KB)\n"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Form my understanding kB means 1000Bytes and KB means 1024Bytes, and in this case Im referring to 1024Bytes.
But i guess just saying this can restart the vim/emacs war again :)

@pauldreik
Copy link
Owner

forgot: please add a test

@trollkarlen
Copy link
Contributor Author

Most should be fixed

@trollkarlen
Copy link
Contributor Author

had some error in sh script.
Added shellsheck workflow and script, just added the new files.
Too many issues in the other files to fix here.
Maybe will do PR later to fix this.

add the argument to change buffer size of checksum calculation,
may increase speed on some checksum algorithms and disk types.

Signed-off-by: Robert Marklund <[email protected]>
Signed-off-by: Robert Marklund <[email protected]>
Signed-off-by: Robert Marklund <[email protected]>
@trollkarlen trollkarlen force-pushed the buffersize-main branch 2 times, most recently from d91b21c to c0e9606 Compare January 23, 2025 17:31
@trollkarlen
Copy link
Contributor Author

I removed "add shellcheck script and workflow", from this patchset and will move that to new PR that fixes all issues aswell

@trollkarlen
Copy link
Contributor Author

trollkarlen commented Jan 24, 2025

Please review again to see if the changes are sufficient.

@pauldreik
Copy link
Owner

pauldreik commented Jan 26, 2025

I am sorry - I had made review comments but forgot to actually send them! On top of that, I missed I had to approve the CI runs.

I worked a bit on this to get this over the finish line.

Most of the changes are things you could not reasonably know. Such things take long time to discuss over the internet, had it been at work we would just had a discussion and get it done quickly.

The differences:

  1. set the default to 1 MiB (after measuring on a small device, after this it got slower. on a big device, there were improvements but very small)
  2. fix reallocating the vector for each file
  3. fix spelling and wording in the documentation
  4. up the limit to 128 MiB, it is heap allocated so no problem with big values.
  5. modified the speed test a bit to include large and small files and make the output more readable
  6. moved the size member in the struct to avoid padding
  7. removed the defaults from fillwithbytes, that was historical garbage
  8. got rid of an include file (old sins)

I hope you are ok with the changes, I made them as fixups to your commits so you are still the author. Please let me know if you want something else!

see the branch here: #180

@pauldreik
Copy link
Owner

closing, in favour of #180

@pauldreik pauldreik closed this Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants