Scale data centre based IPFS nodes #6556

ianopolous · 2019-08-03T22:38:31Z

I'm trying to find an easy way to scale up our hosted ipfs instances in Peergos. Many hosting providers provide object storage which is much cheaper than the VMs attached storage. I'm aware of the in-progress S3 data store, but my understanding is that only one ipfs instance will be able to use the same S3 store. This is because not all data in the data store is content addressed, and thus there is scope for conflict. The obvious example is the pin set, which is stored mutably under the key "/local/pins" (based on my reading of the code - correct me if I'm wrong).

One solution would be to use ipfs cluster, but that introduces unnecessary overhead and cost and doesn't currently fit our needs. Ideally I'd like all our ipfs instances to be able to store blocks in the same S3 and use an actual database like, say mysql, for storing the pinset. This would allow the set of ipfs instances to logically act as one in terms of data stored and pin sets. The assumption here is that the data store has it's own replication guarantees, so no need for duplicates.

My current reading of the code is that the pin set is hard coded to use the datastore, and not a pluggable interface.

Is this something that sounds interesting? @Stebalien @whyrusleeping

Stebalien · 2019-08-07T22:43:29Z

The obvious example is the pin set, which is stored mutably under the key "/local/pins" (based on my reading of the code - correct me if I'm wrong).

We can currently use multiple datastores. You'd have to use the shared one for blocks and a non-shared one for everything else. The pin set is currently stored in the blockstore (as IPLD blocks, actually) and the CID of the current pin root is stored in a separate datastore location.

The tricky part is caching and GC:

We'd have to add a way to configure IPFS to not cache blockstore misses.
We'd have to hard-disable GC.

If you also need GC, this becomes a trickier problem.

We've also discussed using a database for metadata like pins. The sticking points in the past have been:

The original dream was to make data storage self hosting. All data would be stored within an IPLD datastructure within the blockstore. However, that dream is still pretty far off so I'm now all for ditching this until we can make something like that performant.
SQLite requires CGO.
Switching will be a large chunk of work.

However, even if we did switch, I'm not sure I'd want to support concurrently running multiple IPFS daemons against the same pinset. That introduces a whole new level of complexity into IPFS that I'd rather not have to deal with.

ianopolous · 2019-08-07T23:32:43Z

How about this for a simple proposal that solves most of the problems.

make all mutable data stored in the datastore have its path prefixed by the node id. Then any number of ipfs nodes can trivially share the same datastore with no conflict (assuming GC is disabled). So for example the pin set root cid would be stored at "/$nodeid/local/pins"

N.B. You wouldn't need to worry about not caching block store misses, so long as you don't mind a little extra bandwidth intra data centre if a node is asked for a block it doesn't know it has.

Then the setup would then be N ipfs nodes, all using the same S3 datastore. And we've unlocked the ~10X cheaper storage.

This still leaves GC unsolved though. I don't think that can be solved without invoking something global like ipfs cluster (which would actually be logical because it already knows the global pinset). In our case we definitely need GC because encrypted data has zero duplication, and a 1 byte change in plaintext => between 4 KiB and 5 MiB of GC-able blocks.

Actually, here's a fun idea I just thought of. You could kind of approach a generational GC if you had 2 distinct datastores (e.g. 2 buckets in S3), and a way of telling all nodes to switch between them (copying only things they are pinning) then just clear the other datastore entirely.

Stebalien · 2019-08-08T00:47:15Z

make all mutable data stored in the datastore have its path prefixed by the node id. Then any number of ipfs nodes can trivially share the same datastore with no conflict (assuming GC is disabled). So for example the pin set root cid would be stored at "/$nodeid/local/pins"

At the moment, we have (effectively) the inverse: all blocks are stored under /blocks. You can configure IPFS to use a separate blockstore for /blocks than for the rest of the datastore. We actually do this by default: /blocks uses flatfs while everything else uses leveldb.

N.B. You wouldn't need to worry about not caching block store misses, so long as you don't mind a little extra bandwidth intra data centre if a node is asked for a block it doesn't know it has.

The issue is that, as-is, we do cache misses. We'd just need to add a way to turn that off.

TL;DR: As far as I know, the only missing pieces here (assuming no GC) are:

The ability to turn off caching misses.
Clear instructions.

Given what you need, I'd consider taking all the pieces that make up go-ipfs and building a custom tool with two daemons:

A coordinator that handles GC, pins, etc.
"Servers" that all run a DHT client and a bitswap service.

The servers would coordinate with the GC service.

You could also pretty easily implement concurrent GC with some tricks:

When pinning, record the pin before starting. This is what we call a best effort pin in go-ipfs. The downside is that GC could remove a grandchild of the pin if it's missing an intermediate node but that's a very unusual case (and we can just re-download it).
When adding, create a session/transaction to keep any blockes read/written within the transaction from being GCed while the transaction is active.

Really, you could probably reuse 90% of the existing GC/pin logic.

ianopolous · 2019-08-08T15:21:14Z

I think I've convinced myself that we don't need anything extra apart from the S3 data store (and transactions mentioned below). This is great because we don't have the bandwidth to maintain a fork of IPFS or a distinct ipfs-datacentre project.

The two reasons for needing ipfs cluster for our use case were:

Being able to pin a tree that won't fit on a single ipfs node.
Enforcing a duplication/erasure coding policy for data persistence

Both of these go away with an unbounded datastore like S3.

The nice property of a having a shared S3 data store would have been that other ipfs instances could bypass the DHT lookup and retrieve immediately from S3, with zero duplication of data. I think we can achieve this anyway by short circuiting a block get before it even gets to IPFS if we know the "owner" of the block in Peergos parlance. Even if we don't do that it just means that a get on the node would retrieve the block over the DHT, and duplicate it in its own S3 store. But this will be cleaned up the next time this node GCs. So if a file of some user went viral then all our ipfs nodes would naturally end up caching it in the usual way until the load disappeared and each of them GC'd. This not only scales to handle load that is hitting our webservers, but also p2p demand from nodes elsewhere.

When adding, create a session/transaction to keep any blockes read/written within the transaction from being GCed while the transaction is active.

IPFS needs transactions/sessions to not lose data even with a single IPFS node:
#3544
We've already implemented that api on our side and we just noop it when calling ipfs until ipfs implements it as well.

obo20 · 2019-08-08T21:43:28Z

@Stebalien When you state: “However, even if we did switch, I’m not sure I’d want to support concurrently running multiple IPFS daemons against the same pinset. That introduces a whole new level of complexity into IPFS that I’d rather not have to deal with.”

is that purely from a GC / unpinning standpoint?

Or could you theoretically have multiple IPFS daemons using the same pinset if they were only adding

Stebalien · 2019-08-09T21:12:52Z

@obo20

Both, for now. The assumption that the IPFS daemon owns its datastore is baked deeply into the application and sharing a datastore between multiple instances would require quite a bit of additional complexity. We'd need to handle things like distributed locking while updating the pinset.

Blocks are a special case because the same key always maps to the same value. That makes writes idempotent so we don't really need to take any locks.

On the other hand, I'd eventually like to extract all the blockstore related stuff into a separate "data" subsystem. When and if that happens (not for a while), that subsystem would be responsible for pins, data, and GC making it easy to replace the entire set wholesale.

ianopolous · 2019-08-09T22:21:18Z

@Stebalien Happy to close this now if you want?

Stebalien · 2019-08-09T23:04:34Z

We still need a feature to disable caching to make this feature work.

ianopolous · 2019-08-09T23:18:19Z

Nothing needs to change if each ipfs node uses it's own dir in the s3 bucket.

Is there a possibility of including the s3 datastore in go-ipfs itself?

Stebalien · 2019-08-09T23:33:53Z

Nothing needs to change if each ipfs node uses it's own dir in the s3 bucket.

Sure, but I thought you wanted to share blockstores, right? Ah, I see, you don't really care about that as you don't have much deduplication anyways.

Is there a possibility of including the s3 datastore in go-ipfs itself?

It needs to stay a plugin (it's massive) but I also need to fix plugin building.

MichaelMure · 2019-08-15T15:01:38Z

We still need a feature to disable caching to make this feature work.

Isn't that simply setting the bloom filter to zero ?

Stebalien · 2019-08-15T18:06:03Z

It needs to stay a plugin (it's massive) but I also need to fix plugin building.

Specifically, ~6MiB (+15%). However, I'm going to try to make it easier to pull the plugin in at compile time.

Isn't that simply setting the bloom filter to zero ?

We have two caches: A bloom filter and an LRU. We need to disable both.

MichaelMure · 2019-08-15T21:31:03Z

We have two caches: A bloom filter and an LRU. We need to disable both.

You are talking about the LRU cache in namesys, correct ?

Edit: wait, no, this has nothing to do with the datastore. I'm confused now.

Stebalien · 2019-08-15T22:06:08Z

Ah, sorry, ARC, not LRU.

I'm talking about the github.com/ipfs/go-ipfs-blockstore.CachedBlockstore. If you turn the Bloom filter down to 0, you'll still get the ARC cache and there's currently no option to disable it.

Stebalien · 2019-08-15T22:06:58Z

Take a look at Storage in core/node/groups.go. You can see how we configure the bloomfilter size but not the ARC cache size (also a cache option).

ianopolous · 2020-04-24T21:25:02Z

Following on with this, now we are well and truly down the S3 blockstore route. We have our own implementation of gc that acts directly on the blockstore outside of ipfs (and we manage our own pinset). One thing that makes me nervous is that if ipfs isn't aware of any pins and ever tries to do a GC it will delete everything. Is there a way to hard disable GC?

Stebalien · 2020-04-24T22:44:44Z

GC won't happen if you haven't enabled it and you don't call ipfs repo gc. However, there's no "don't gc ever" flag. Want to add a config option (DisableGC)? When enabled, ipfs would refuse to garbage collect.

acejam · 2022-07-07T15:44:53Z

@ianopolous Any details on how Peergos handles GC directly on an S3-backed blockstore?

ianopolous · 2022-07-08T08:51:02Z

@acejam We have our own fully concurrent GC implementation that can operate directly on the blockstore, or via the ipfs block api. This is enabled by our implementation of transactional block writes (where each write gets tagged with a transaction id, and then you close the transaction after committing the root to the pin set. With transactional writes the GC algorithm is very simple. Essentially:

list the blocks in the blockstore
list the pinned roots
list the blocks in open transactions
mark the blocks in the list from (1) reachable from the roots, and the open transactions
delete the unreachable blocks

None of that needs to hold any global locks, as long as there is a "happens before" between (1) and (2)

bjornleffler · 2023-05-19T06:46:52Z

@ianopolous I'm curious how much a consistent object store backend would solve your problems. It seems like you're trying to work around some limitations with S3. I'm building a new IPFS datastore for Google Cloud Storage. GCS is both strongly consistent and globally accessible. Would some problems go away if all IPFS nodes had a consistent view of the backend datastore?

As far as I can tell, there is no pin logic in github.com/ipfs/go-datastore, so it's unclear to me how this could be solved in the datastore layer. I haven't looked at the pinning code nor GC code. What am I missing? Why is this a hard problem?

ianopolous · 2023-05-19T10:26:56Z

@bjornleffler The general problem (which isn't a problem for us any more) is that if you have multiple ipfs nodes pointing to the same blockstore (nothing to do with S3 specifically) and if one does a GC concurrently with the other writing blocks then those blocks may be GC'd even if they would eventually be pinned. Kubo gets around this with a global lock and assuming no other ipfs shares the same blockstore.

bjornleffler · 2023-05-19T11:11:50Z

Thank you for clarifying. Why isn't this a problem anymore?

ianopolous · 2023-05-19T12:06:42Z

We have our own concurrent GC implementation external to IPFS.

ianopolous added the kind/enhancement A net-new feature or improvement to an existing feature label Aug 3, 2019

ianopolous changed the title ~~Extract interface for pinset store~~ Scale data centre based IPFS nodes Aug 8, 2019

MichaelMure mentioned this issue Aug 19, 2019

add a size config for the datastore/ARCCache ipfs/go-ipfs-config#41

Closed

msterle mentioned this issue Jan 29, 2020

Research and test implications of using a single IPFS datastore and orbitdb-cache 3box/3box-pinning-node#280

Open

ipfs deleted a comment from bitcard Mar 2, 2021

aschmahmann mentioned this issue Jul 22, 2022

Support CGO cross building in docker container ipfs/distributions#546

Open

ianopolous closed this as completed May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale data centre based IPFS nodes #6556

Scale data centre based IPFS nodes #6556

ianopolous commented Aug 3, 2019

Stebalien commented Aug 7, 2019

ianopolous commented Aug 7, 2019

Stebalien commented Aug 8, 2019

ianopolous commented Aug 8, 2019

obo20 commented Aug 8, 2019

Stebalien commented Aug 9, 2019

ianopolous commented Aug 9, 2019

Stebalien commented Aug 9, 2019

ianopolous commented Aug 9, 2019

Stebalien commented Aug 9, 2019

MichaelMure commented Aug 15, 2019

Stebalien commented Aug 15, 2019

MichaelMure commented Aug 15, 2019 •

edited

Loading

Stebalien commented Aug 15, 2019

Stebalien commented Aug 15, 2019 •

edited

Loading

ianopolous commented Apr 24, 2020

Stebalien commented Apr 24, 2020

acejam commented Jul 7, 2022

ianopolous commented Jul 8, 2022 •

edited

Loading

bjornleffler commented May 19, 2023

ianopolous commented May 19, 2023

bjornleffler commented May 19, 2023

ianopolous commented May 19, 2023

Scale data centre based IPFS nodes #6556

Scale data centre based IPFS nodes #6556

Comments

ianopolous commented Aug 3, 2019

Stebalien commented Aug 7, 2019

ianopolous commented Aug 7, 2019

Stebalien commented Aug 8, 2019

ianopolous commented Aug 8, 2019

obo20 commented Aug 8, 2019

Stebalien commented Aug 9, 2019

ianopolous commented Aug 9, 2019

Stebalien commented Aug 9, 2019

ianopolous commented Aug 9, 2019

Stebalien commented Aug 9, 2019

MichaelMure commented Aug 15, 2019

Stebalien commented Aug 15, 2019

MichaelMure commented Aug 15, 2019 • edited Loading

Stebalien commented Aug 15, 2019

Stebalien commented Aug 15, 2019 • edited Loading

ianopolous commented Apr 24, 2020

Stebalien commented Apr 24, 2020

acejam commented Jul 7, 2022

ianopolous commented Jul 8, 2022 • edited Loading

bjornleffler commented May 19, 2023

ianopolous commented May 19, 2023

bjornleffler commented May 19, 2023

ianopolous commented May 19, 2023

MichaelMure commented Aug 15, 2019 •

edited

Loading

Stebalien commented Aug 15, 2019 •

edited

Loading

ianopolous commented Jul 8, 2022 •

edited

Loading