-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale data centre based IPFS nodes #6556
Comments
We can currently use multiple datastores. You'd have to use the shared one for blocks and a non-shared one for everything else. The pin set is currently stored in the blockstore (as IPLD blocks, actually) and the CID of the current pin root is stored in a separate datastore location. The tricky part is caching and GC:
If you also need GC, this becomes a trickier problem. We've also discussed using a database for metadata like pins. The sticking points in the past have been:
However, even if we did switch, I'm not sure I'd want to support concurrently running multiple IPFS daemons against the same pinset. That introduces a whole new level of complexity into IPFS that I'd rather not have to deal with. |
How about this for a simple proposal that solves most of the problems.
N.B. You wouldn't need to worry about not caching block store misses, so long as you don't mind a little extra bandwidth intra data centre if a node is asked for a block it doesn't know it has. Then the setup would then be N ipfs nodes, all using the same S3 datastore. And we've unlocked the ~10X cheaper storage. This still leaves GC unsolved though. I don't think that can be solved without invoking something global like ipfs cluster (which would actually be logical because it already knows the global pinset). In our case we definitely need GC because encrypted data has zero duplication, and a 1 byte change in plaintext => between 4 KiB and 5 MiB of GC-able blocks. Actually, here's a fun idea I just thought of. You could kind of approach a generational GC if you had 2 distinct datastores (e.g. 2 buckets in S3), and a way of telling all nodes to switch between them (copying only things they are pinning) then just clear the other datastore entirely. |
At the moment, we have (effectively) the inverse: all blocks are stored under
The issue is that, as-is, we do cache misses. We'd just need to add a way to turn that off. TL;DR: As far as I know, the only missing pieces here (assuming no GC) are:
Given what you need, I'd consider taking all the pieces that make up go-ipfs and building a custom tool with two daemons:
The servers would coordinate with the GC service. You could also pretty easily implement concurrent GC with some tricks:
Really, you could probably reuse 90% of the existing GC/pin logic. |
I think I've convinced myself that we don't need anything extra apart from the S3 data store (and transactions mentioned below). This is great because we don't have the bandwidth to maintain a fork of IPFS or a distinct ipfs-datacentre project. The two reasons for needing ipfs cluster for our use case were:
Both of these go away with an unbounded datastore like S3. The nice property of a having a shared S3 data store would have been that other ipfs instances could bypass the DHT lookup and retrieve immediately from S3, with zero duplication of data. I think we can achieve this anyway by short circuiting a block get before it even gets to IPFS if we know the "owner" of the block in Peergos parlance. Even if we don't do that it just means that a get on the node would retrieve the block over the DHT, and duplicate it in its own S3 store. But this will be cleaned up the next time this node GCs. So if a file of some user went viral then all our ipfs nodes would naturally end up caching it in the usual way until the load disappeared and each of them GC'd. This not only scales to handle load that is hitting our webservers, but also p2p demand from nodes elsewhere.
IPFS needs transactions/sessions to not lose data even with a single IPFS node: |
@Stebalien When you state: “However, even if we did switch, I’m not sure I’d want to support concurrently running multiple IPFS daemons against the same pinset. That introduces a whole new level of complexity into IPFS that I’d rather not have to deal with.” is that purely from a GC / unpinning standpoint? Or could you theoretically have multiple IPFS daemons using the same pinset if they were only adding |
Both, for now. The assumption that the IPFS daemon owns its datastore is baked deeply into the application and sharing a datastore between multiple instances would require quite a bit of additional complexity. We'd need to handle things like distributed locking while updating the pinset. Blocks are a special case because the same key always maps to the same value. That makes writes idempotent so we don't really need to take any locks. On the other hand, I'd eventually like to extract all the blockstore related stuff into a separate "data" subsystem. When and if that happens (not for a while), that subsystem would be responsible for pins, data, and GC making it easy to replace the entire set wholesale. |
@Stebalien Happy to close this now if you want? |
We still need a feature to disable caching to make this feature work. |
Nothing needs to change if each ipfs node uses it's own dir in the s3 bucket. Is there a possibility of including the s3 datastore in go-ipfs itself? |
Sure, but I thought you wanted to share blockstores, right? Ah, I see, you don't really care about that as you don't have much deduplication anyways.
It needs to stay a plugin (it's massive) but I also need to fix plugin building. |
Isn't that simply setting the bloom filter to zero ? |
Specifically, ~6MiB (+15%). However, I'm going to try to make it easier to pull the plugin in at compile time.
We have two caches: A bloom filter and an LRU. We need to disable both. |
You are talking about the LRU cache in namesys, correct ? Edit: wait, no, this has nothing to do with the datastore. I'm confused now. |
Ah, sorry, ARC, not LRU. I'm talking about the |
Take a look at |
Following on with this, now we are well and truly down the S3 blockstore route. We have our own implementation of gc that acts directly on the blockstore outside of ipfs (and we manage our own pinset). One thing that makes me nervous is that if ipfs isn't aware of any pins and ever tries to do a GC it will delete everything. Is there a way to hard disable GC? |
GC won't happen if you haven't enabled it and you don't call |
@ianopolous Any details on how Peergos handles GC directly on an S3-backed blockstore? |
@acejam We have our own fully concurrent GC implementation that can operate directly on the blockstore, or via the ipfs block api. This is enabled by our implementation of transactional block writes (where each write gets tagged with a transaction id, and then you close the transaction after committing the root to the pin set. With transactional writes the GC algorithm is very simple. Essentially:
None of that needs to hold any global locks, as long as there is a "happens before" between (1) and (2) |
@ianopolous I'm curious how much a consistent object store backend would solve your problems. It seems like you're trying to work around some limitations with S3. I'm building a new IPFS datastore for Google Cloud Storage. GCS is both strongly consistent and globally accessible. Would some problems go away if all IPFS nodes had a consistent view of the backend datastore? As far as I can tell, there is no pin logic in github.com/ipfs/go-datastore, so it's unclear to me how this could be solved in the datastore layer. I haven't looked at the pinning code nor GC code. What am I missing? Why is this a hard problem? |
@bjornleffler The general problem (which isn't a problem for us any more) is that if you have multiple ipfs nodes pointing to the same blockstore (nothing to do with S3 specifically) and if one does a GC concurrently with the other writing blocks then those blocks may be GC'd even if they would eventually be pinned. Kubo gets around this with a global lock and assuming no other ipfs shares the same blockstore. |
Thank you for clarifying. Why isn't this a problem anymore? |
We have our own concurrent GC implementation external to IPFS. |
I'm trying to find an easy way to scale up our hosted ipfs instances in Peergos. Many hosting providers provide object storage which is much cheaper than the VMs attached storage. I'm aware of the in-progress S3 data store, but my understanding is that only one ipfs instance will be able to use the same S3 store. This is because not all data in the data store is content addressed, and thus there is scope for conflict. The obvious example is the pin set, which is stored mutably under the key "/local/pins" (based on my reading of the code - correct me if I'm wrong).
One solution would be to use ipfs cluster, but that introduces unnecessary overhead and cost and doesn't currently fit our needs. Ideally I'd like all our ipfs instances to be able to store blocks in the same S3 and use an actual database like, say mysql, for storing the pinset. This would allow the set of ipfs instances to logically act as one in terms of data stored and pin sets. The assumption here is that the data store has it's own replication guarantees, so no need for duplicates.
My current reading of the code is that the pin set is hard coded to use the datastore, and not a pluggable interface.
Is this something that sounds interesting? @Stebalien @whyrusleeping
The text was updated successfully, but these errors were encountered: