- 2020 gives slightly better deduplication ratio than 2016 on all chunk sizes.
- FastCDC produces fewer but bigger chunks. Ronomon/StadiaCDC chunks are closer to the avg.
- Increasing NC decreases chunk sizes and increases their count, and vice versa. However, the deduplication ratio is the best at the default NC2, even though NC3 emits more chunks.
- Favors
0.5*avg/avg/<=2*avg
in terms of deduplication.
- Ronomon/ronomon64 produce chunks with sizes that are close to the average. There is a shift towards max for the 2MB/4MB avg. The number of chunks is :
- 1.4k at 512KB
- 11.5k at 64KB
- NC2/3 give smaller chunks. The number of chunks:
- 1.8k and 2.2k at 512KB respectively
- 15k and 18k at 64KB respectively
- for comparison, RC1 (default) produces 23k chunks for 32KB avg.
The bigger number of chunks is likely caused by the fact that Ronomon uses an adaptive threshold for switching to the less strict mask faster.
-
Deduplication ratio is almost the same for all the min/avg/max proportions, except
0.75*avg/avg/1.5*avg
. The ratio decreases from roughly 7.5% to 6.6% for these min/max chunks sizes. -
The 64 bit digest performs the same as 32 bit.
- Best deduplication ratio is for the Buzhash32 with regression and
64
/128
byte windows, especially with chunk sizes0.25*avg/avg/4*avg
. Admittedly, this deduplication ratio comes at a cost of having the biggest number of chunks. - For Buzhash64
256
bit window with regression shows the best deduplication. - These groups of windows work similarly:
- 32 bit for 48/96 window
- 32 bit for 64/128 window
- 64 bit for 32/48/96/min_chunk window
- 64 bit for 64/128/256/512 window
min_chunk
without regression gives the worst deduplication ratios. It is then preferable to avoid usingmin_chunk
as a window size to skip bytes that would otherwise need to be hashed.- Regression lets the algorithm produce more and smaller chunks with better deduplication.
- 64-bit hash gives up to 10% fewer chunks for the same window/chunk sizes. Deduplication is similar. The chunk sizes are closer to the desired average sizes with 64-bit hash.
-
Even though the time measurement are not precise in this benchmark, it is clearly visible that the predicate that Casync uses almost doubles the execution time.
-
The
0.25*avg/avg/4*avg
is the default chunk distribution that is used in Casync, and it indeed gives the best deduplication ratio across all chunk sizes.
- The default
0.5*avg/avg/8*avg
doesn't give the best deduplication even for the defaultavg=1MB
. - Chunk sizes are close to the configured average sizes, especially for
avg>=512KB
and0.5*avg/avg/8*avg
.
- Small window sizes 32/48/64 skew the chunk sizes to the minimum
- It is preferable to have bigger windows to have chunk sizes closer to the average. For example,
min_chunk
ormin_chunk/2
. - Masks with random bits produce a lot of small chunk without any visible impact on the deduplication.
- Deduplication deteriorates for chunk sizes >256KB.
- FastCDC-like Normalization almost doesn't influence the algorithm's chunk counts or sizes.
- The deduplication is quite stable different average chunk sizes. It doesn't decrease as fast as FastCDC does.
- The window size
5
that was used in the paper is not appropriate for chunks bigger than256KB
, because it produces split points too early. This leads to more and smaller chunks without improved deduplication ratios. - Use window sizes
avg_chunk / 1024
to emit chunks with sizes closer to the average chunk size.