[RFC/Braindump] Change how blocks are stored to allow for filesystem features (reflink, dedup, etc...) #10016

markg85 · 2023-07-12T19:54:38Z

Checklist

My issue is specific & actionable.
I am not suggesting a protocol enhancement.
I have searched on the issue tracker for my issue.

Description

Hi,

#8198 is very much related to this though it relies on OS features and a blockstore implementing those. My proposal is taking a 180 degree different angle on this. Both would achieve the same goal (data deduplication).

Currently adding a file to IPFS chunks it and adds those individual blocks to the data folder. The filestore datastore partly does what's proposed here. Though it's design - not storing the file itself - is counter to this proposal. A version that does what's proposed here would be the filestore with the added feature to save the full data too. While very redundant at first glance, it allows for filesystem features to kick in and solve that redundancy.

Some command outputs to help clear the intent. Let's say i have this big file:

❯ ls -l
total 14706932
-rw-r--r-- 1 mark mark 15059891792 Jul 12 16:19 big_file

The file, in this case, would have the following blake3 hash. This matters for the example later on when it's in the ipfs cache folder.

❯ b3sum big_file
b792e5e711008fcf90548189bbe4771aea9c1a9751066c866e05637124791ade  big_file

Now if we add this file ipfs add big_file we get (side note, the hash doesn't matter, just using blake3 cause i like it):

❯ ipfs add big_file -n --hash=blake3
added bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci big_file
 14.03 GiB / 14.03 GiB [==============================] 100.00%

Within the ipfs data folder we should now have the file bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci but with the blake3 crc of b792e5e711008fcf90548189bbe4771aea9c1a9751066c866e05637124791ade, thus matching the source file! It's a copy.

❯ ls -l
total 14706932
-rw-r--r-- 1 mark mark 15059891792 Jul 12 16:19 bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci
❯ b3sum bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci
b792e5e711008fcf90548189bbe4771aea9c1a9751066c866e05637124791ade  bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci

All of the above is just using commands to visualize the intended output. None of this exists in IPFS yet. But if it were to exist then this would give us all the features filesystems have to offer!

Data deduplication? We would get that for free based on the filesystem used. If one were to use zfs/btrfs/bcachefs then you'd already have deduplication right there!
reflink/hardlink/symbolic link, all would work right out of the box based on the filesystem.
Importing data to ipfs would be as simple as "copying" it to the ipfs data folder (or one of the linking methods). Same for exporting, just copy.
and many more smaller advantages

A thing making this slightly complicated is missing metadata. You need to know the chunk, the offset and the CID it has. The ipfs add --nocopy command is probably doing something like this already (haven't looked at it's code) so technically this shouldn't be that difficult to solve. I can see two possible ways that would seem clean to me.

Option 1: a file per chunk
When adding a file in this proposed mechanism would create a folder in this structure:
<cid>_<num_of_blocks>_metadata (so: bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci_43580_metadata)
The num_of_blocks in the name tells kubo how many blocks that file has. If a count of all files within that folder is done and that count equals what's in the folder name (thus 43580 in this case) then kubo can assume the whole file is known to kubo.
The contents of this folder would contain the individual chunks in the following format (note only filenames, no file content):
<cid>_<chunk_begin_in_bytes>_<chunk_end_in_bytes>
Which would make it look like:

bafyA...._0_1023
bafyB...._1024_2047
bafyC...._2048_3071
...

This structure allows to quickly know which chunks of said file are known to the IPFS instance, a simple file exists check for a specific chunk would tell.
The CRC of that specific chunk is part of the CID spec so if needed a specific chunk can be checked for CRC too.
Knowing which chunks are missing is slightly more complicated in this option as one would need to parse all filenames to build op a structure to tell them which chunks are missing.

Option 2: Single metadata file
Alternatively a single metadata file with it's name in this format:
<cid>_<num_of_blocks>_metadata
The name logic is the same as option 1 above, just as a file instead of a folder.

The file content would tell information per block.
The format would be:
<index>,<cid>,<chunk_begin_in_bytes>,<chunk_end_in_bytes>

Which would look somewhat like this:

0,bafyA....,0,1023
1,bafyB....,1024,2047
2,bafyC....,2048,3071
...

The index here defines the individual block position. This also means that the file doesn't have to be sorted. Any order would be perfectly valid too:

2,bafyC....,2048,3071
0,bafyA....,0,1023
1,bafyB....,1024,2047

This is done to make appending to this file possible when downloading the source file. A block could arrive out-of-order which can just be appended to the metadata file in this design.

Knowing if the file is complete can be done by counting the lines in this metadata file. If that matches the number in the metadata filename (43580 in this case) then you can assume the whole file and all it's blocks are known to Kubo.

Is the whole file known?
As a side note to both options. What i'm describing here are weak checks based on a count to determine if a file is complete. Those checks could be stronger. Blake3 streaming verification might be interesting.

Both options are lightweight
The metadata is the only added data to make a file work for IPFS. The size of the metadata proposed here is in the order of kilobytes. The metadata for a single block would be:
bafyb4idj7r6ilfrlve7hwgx2kqgoquh6snhgkr2srpmlj32zgx6jwy5kci_0000000000000_0000000000000
Say that's - rounded - to 90 bytes, and that's being generous with offsets in the range of terrabytes and the lengthy bafy cid. If a file would have 10000 of such blocks then it's metadata overhead size would be (90 * 10000) ~878 kilobytes. With the default chunker (creates 256KB blocks) you'd have ~10000 blocks with a file of ~2.5GB.

Opinions?
Let me know what you think of this idea!
Are there better ways perhaps?

The text was updated successfully, but these errors were encountered:

aschmahmann · 2023-08-21T13:16:00Z

A version that does what's proposed here would be the filestore with the added feature to save the full data to

This seems like a duplicate of #3981. Post if you disagree and this can be reopened.

markg85 · 2023-08-21T15:10:46Z

True.
That issues triggered me to write this RFC.

If that issue is clear enough (it is to me) then having a tracking it's progress + follow up there makes sense.
If that's not clear or if an RFC needs to be made... then it might be better to close that issue and move the discussion here. Both are fine by me.

Now it's up to you again :)

markg85 added the kind/enhancement A net-new feature or improvement to an existing feature label Jul 12, 2023

lidel added the need/triage Needs initial labeling and prioritization label Aug 17, 2023

aschmahmann closed this as completed Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC/Braindump] Change how blocks are stored to allow for filesystem features (reflink, dedup, etc...) #10016

[RFC/Braindump] Change how blocks are stored to allow for filesystem features (reflink, dedup, etc...) #10016

markg85 commented Jul 12, 2023 •

edited

Loading

aschmahmann commented Aug 21, 2023

markg85 commented Aug 21, 2023

[RFC/Braindump] Change how blocks are stored to allow for filesystem features (reflink, dedup, etc...) #10016

[RFC/Braindump] Change how blocks are stored to allow for filesystem features (reflink, dedup, etc...) #10016

Comments

markg85 commented Jul 12, 2023 • edited Loading

Checklist

Description

aschmahmann commented Aug 21, 2023

markg85 commented Aug 21, 2023

markg85 commented Jul 12, 2023 •

edited

Loading