Deduplicate files at fast speeds! Written in RUST.
- Well rust.
- Input lines directly streamed to the processing threads without collecting them all first.
- Partitions the hash space to reduce lock contention.
In the below test we utilise a small 75mb file (else we wait too long for hyperfine) with 1 595 966 lines of data.
When we up the anty a little bit going to large files 2.3gb we see some improvements.
When we compare with the likes of duplicut (https://github.com/nil0x42/duplicut) some significant improvements can be seen, however, I'm not sure if this boils down to the rust usage over c.
cat file.txt | rustdedup
rustdedup -i /diska9.txtextra.csvmodded.csv -o output2.txt