Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coretime Scheduling Regions #3

Closed
wants to merge 10 commits into from

Conversation

rphmeier
Copy link
Contributor

@rphmeier rphmeier commented Jul 2, 2023

This introduces a new scheduling primitive for the relay-chain known as Coretime Scheduling. It is intended to be a companion to paritytech/polkadot#1 but can function with any higher-level logic for allocating cores.

This is intended to be a general solution designed with some of the following features in mind:

  • on-demand coretime
  • elastic scaling
  • scheduling multiple chains to share a core with minimal friction
  • posting many small blocks to a core every relay-chain block

There may be other use-cases we don't know of yet, but I'm fairly confident that this solution is general enough to accommodate pretty much anything we throw at it.

@rphmeier rphmeier mentioned this pull request Jul 2, 2023
@gavofyork
Copy link
Contributor

Would it be reasonable to describe this as taking a Parachain-oriented, rather than Core-oriented, approach? I.e. in that it does't describe what each core should be doing, but rather which parachains should be validated and at what frequency divisor.

Is this is necessary condition of the current implementation?

@rphmeier
Copy link
Contributor Author

rphmeier commented Jul 2, 2023

Would it be reasonable to describe this as taking a Parachain-oriented, rather than Core-oriented, approach?

The main reason for regions being anchored to specific cores is so validators assigned to a core have an indication of which parachain candidates they should or shouldn't validate. If regions didn't hold any core information, we'd either have to generate the mapping between regions and cores in the runtime (requiring low runtime scheduling overhead is Goal (1)) or validators could potentially work on any and all candidates, and wouldn't have much idea of what their group-mates were doing.

In the latter case, we'd also need all validators to coordinate off-chain about which parachains which groups intend to work on. I'm not aware of any low-overhead solutions for doing that.

I wouldn't rule out the possibility of removing core indices from regions, but it'd require fairly significant rewrites of the networking protocols. Maybe something to iterate towards, but we are constrained by fairly deep implementation assumptions.

but rather which parachains should be validated and at what frequency divisor.

Actually this approach does exactly that - it just goes a little further and constrains which core a parachain can be validated on. But note that parachains can have multiple regions corresponding to many different cores, and their effective rate is the sum of all their regions' rates. And cores can have an arbitrary number of regions assigned to them, as long as the block production rate never exceeds 1 across all of them.

On the actual validator implementation side, I believe the code will be much more parachain-focused than core-focused. Cores only come into question for the relay-chain block author, when selecting parachain blocks on regions that are ready, just to make sure that it's not posting more than the allowed amount of parachain blocks per core. The rest happens asynchronously.

@gavofyork
Copy link
Contributor

Could you explain why there's a need to introduce an anonymous identifier (RegionId) and correspondingly indexing scheduling information in terms of this identifier rather than simply indexing things in terms of a low-level period and the core index? This necessitates the maintenance of a reverse lookup and in general anonymous IDs are a bit horrible.

If PoVs were offered based on a CoreId rather than RegionId wouldn't this provide adequate information to ensure Parachains were not over-scheduled?

If we distill things a little (assume maximum == 0 and avoid the indirection of RegionSchema) we end up with RegionId mapping to a Para assignee, the core on which it will run, the blocks it should be taking on that core (start, duration and stride) and a variable to track its usage so far count. Here, RegionId is not in any way meaningful - it's an anonymous key, instances of which must be tracked in some way to be able to place one in a PoV. This anonymous identifier is exposed all the way from the high-level brockerage system chain down to the PoV of the collator. Yet for any given (start, duration, core) triplet, I would expect there to be only one ParaId, which would seem to indicate that a self-describing key including the core index and a period of time would do just as well.

@rphmeier
Copy link
Contributor Author

rphmeier commented Jul 3, 2023

Could you explain why there's a need to introduce an anonymous identifier (RegionId)

If regions are ever made mutable, i.e. are possible to split by duration or decompose by frequency, then having stable IDs could be important on the node-side. It's less necessary in an immutable paradigm, but I'd bet on needing some limited amount of mutability in the future.

The only property we really need from Region IDs is uniqueness (within a relay-chain). I don't think the format of it changes the implementation much, although self-describing does help reduce storage proof overheads on paras. If region IDs are passed around on the network, nodes still need to check with the runtime APIs that the region exists and which para it is assigned to.

This necessitates the maintenance of a reverse lookup and in general anonymous IDs are a bit horrible.

Could you elaborate? The first reverse lookup is ParaId, RegionId => () which we need to maintain for collators to quickly index and prove the regions belonging to their parachain. That'd be the same regardless of the region ID format. With an anonymous region ID we do have to look up the full region in the state proof as well, but I think we'd need to do that anyway, since the ID wouldn't be fully self-describing.

The second reverse lookup is RegionId => Region which we need on validators to check the state of regions over time and on network events. All the information in the Region structure is needed on the node-side, so what RegionId is doesn't matter much here as long as it's unique.

Yet for any given (start, duration, core) triplet, I would expect there to be only one ParaId

Regions' data or schemas aren't guaranteed to be unique at all. In fact, one ParaId could own multiple identical regions. Core-sharing between two chains, for instance, is easily implemented as two regions, with the same (start, duration, core) triplet and a rate_numerator of RATE_DENOMINATOR/2 (i.e. block production rate of 1/2). In this proposal we don't attempt to fix the specific relay-chain blocks that parachains get to post blocks within, just the average rate and maximum over a period of time.

@gavofyork
Copy link
Contributor

gavofyork commented Jul 3, 2023

If regions are ever made mutable, i.e. are possible to split by duration or decompose by frequency, then having stable IDs could be important on the node-side. It's less necessary in an immutable paradigm, but I'd bet on needing some limited amount of mutability in the future.

Surely "mutability" in this context would be better termed "reversion of allocation" and I don't see why we would ever want to support it. You can trade and move the bulk Coretime you buy but once you commit to an allocation of that Coretime, I don't see why would want to add substantial complexity onto not only the Relay-chain but also the Broker chain to be able to revert that commitment.

@gavofyork
Copy link
Contributor

Regarding "reverse lookup" I was referring to there being two distinct maps providing perspectives into the same underlying relation. In this case RegionId is mapped to a struct including a assignee, but there is also a map mapping assignee to RegionId (for iteration).

@gavofyork
Copy link
Contributor

Regions' data or schemas aren't guaranteed to be unique at all. In fact, one ParaId could own multiple identical regions. Core-sharing between two chains, for instance, is easily implemented as two regions, with the same (start, duration, core) triplet and a rate_numerator of RATE_DENOMINATOR/2 (i.e. block production rate of 1/2). In this proposal we don't attempt to fix the specific relay-chain blocks that parachains get to post blocks within, just the average rate and maximum over a period of time.

What I said does not exclude this.

For example, if between times T and T', both cores paritytech/polkadot#1 and paritytech/polkadot#2 were to be shared equally between two paras 1000 and 1001, this could equally be expressed by two schemas:

(Core(1), Timeslice(T)) -> vec[1000, 1001],
(Core(2), Timeslice(T)) -> vec[1000, 1001],
(Core(1), Timeslice(T')) -> vec[],
(Core(2), Timeslice(T')) -> vec[],

and

AnonymousKey1 -> (Core(1), Para(1000), T, T', 1/2)
AnonymousKey2 -> (Core(1), Para(1001), T, T', 1/2)
AnonymousKey3 -> (Core(2), Para(1000), T, T', 1/2)
AnonymousKey4 -> (Core(2), Para(1001), T, T', 1/2)

1000 -> AnonymousKey1 -> ()
1000 -> AnonymousKey3 -> ()
1001 -> AnonymousKey2 -> ()
1001 -> AnonymousKey4 -> ()

In the top schema, the keys are distinct even though the region usage is the same for two different cores. You don't need to introduce the AnonymousKeys (in the second schema). Obviously without AnonymousKey the whole thing is rather a lot simpler.

@gavofyork
Copy link
Contributor

Low-latency instantaneous Coretime scheduling isn't especially well expressible in the timeslicing schema, since the timeslices are assumed to be of the order of 100 blocks. However, I don't think the Broker chain could handle any low-latency scheduling, regardless of the schema since at high usage it would be too much information to be sensibly aggregating and passing through UMP. Instantaneous Coretime is a fundamentally different beast to Bulk Coretime - it cannot be turned into an NFT, split or composed. Because of this, it's surely best to handle directly on the Relay-chain.

@rphmeier
Copy link
Contributor Author

rphmeier commented Jul 3, 2023

As an aside, breaking things down into timeslices isn't something I see as a requirement on the relay-chain. They are useful for higher-level logic as a simplification and packaging of coretime, but the relay-chain is more flexible when able to handle things at block-level granularity.

Though for the bulk of this proposal, breaking things down by timeslice is not a significant factor and I'd be open to changing things on that basis, though timeslices are only unique within periods 28 * DAYS.

For example, if between times T and T', both cores #1 and #2 were to be shared equally between two paras 1000 and 1001, this could equally be expressed by two schemas:
(Core(1), Timeslice(T)) -> vec[1000, 1001],
(Core(2), Timeslice(T)) -> vec[1000, 1001],

This mapping does not scale very well as-is, in the sense that it would require either
a) relay-chain runtime logic for creating/cleaning up these storage entries for upcoming and expired timeslices - this is counter to goal (1) of this RFC, which is to push scheduling overhead onto nodes and keep it out of the runtime
b) tens of thousands of simultaneous storage items on the relay chain to be created eagerly
c) constant blitting of upcoming timeslice information by the broker chain onto the relay chain

The regions solution is also designed to accommodate finer-grained scheduling. A simple Vec<ParaId> only accommodates equal splits of coretime, whereas unequal splits are likely to be useful in the future. A simple use-case would be a parachain deciding to sell off 1/4 of its coretime for the next day, due to low demand, or likewise having a pallet on a parachain which is willing to reimburse buyers for all regions of arbitrary size, so long as blocks remain sufficiently full. That's not where we are today, but making decisions to enable such possibilities is quite important in my view.

The primary reason to have a mapping between ParaIds and assigned regions (with sufficiently unique IDs) is to enable elastic scaling. I believe that any proposal which does not allow a simple lookup from ParaId to all near-future scheduling information across all cores is a non-starter in the implementation of elastic scaling and would need to be reworked at that point, which I would very much like to avoid. What the ID is for that purpose is fairly immaterial, but uniqueness is quite important. A mapping from (CoreId, Timeslice) -> ParaId would require iterating all cores for the timeslice just for nodes to determine this information. Spending relay-chain execution time in computing this mapping is also wasteful.

We should have a single scheduling mechanism on the relay chain that handles all allocations of coretime, regardless of which mechanisms or chains create and manage those allocations - for the tens of thousands of on-demand claims along with longer-term scheduled paras. If not, we would need to introduce a separate scheduling mechanism for on-demand, and node and runtime implementers would need to maintain two parallel implementations of scheduling tracking. Given the already substantial complexity of Polkadot, I believe this is best avoided by unifying both and this will pay dividends in implementation time and maintenance overhead.

Note that this proposal's format need only be used in the implementation of allocate on a broker-chain, which would transform and blit a region, and could even be abstracted away from the broker entirely via XCM. i.e. if the broker is enshrined, the relay chain can accept blitting information in whatever format the broker wants and transform it under the hood.

There are other goals like making up for missed opportunities in a timely way that aren't met by the above mapping but that's all described in the RFC text.

@eskimor
Copy link

eskimor commented Jul 3, 2023

A simple Vec only accommodates equal splits of coretime

This does not seem to be true, e.g.: [A,A,A,B], would give A 3x the time it gives to B.

@rphmeier
Copy link
Contributor Author

rphmeier commented Jul 4, 2023

You can do that, but it doesn't scale very well beyond very simple fractions. It also doesn't accommodate situations where you've split a region into e.g. 10 parts and only 1/10 is currently allocated while the other 9/10 are being traded and haven't yet been allocated.

@rphmeier rphmeier mentioned this pull request Jul 13, 2023
@rphmeier rphmeier force-pushed the rh-blockspace-regions branch from 6de9bf1 to 0d88c3c Compare July 25, 2023 16:56
@rphmeier rphmeier changed the title Blockspace Regions Coretime Scheduling Regions Jul 28, 2023
// Relay-chain block number at which this region becomes active.
start: u32,
// The end point of the region in relay-chain blocks, i.e. it ends at start + duration. Note that endpoints are flexible up to `SCHEDULING_LENIENCE`.
// This may be `None`, in the case that the true endpoint is only determined later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases the endpoint is determined later ?

Copy link
Contributor Author

@rphmeier rphmeier Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via RFC-5 (#5) - the interface can be used more efficiently by not specifying the endpoint.

On the RFC-1 broker chain (or any other marketplace using that interface), you could imagine a region owner pre-emptively splitting up its coretime into e.g. 10 minute chunks, and assigning each shortly before the deadline. The None endpoint allows to avoid sending a message for each chunk as long as the assignment stays the same as it was before.

text/0003-blockspace-regions.md Outdated Show resolved Hide resolved

The scheduling lenience allows regions to fall behind their expected tickrate, limited to a small maximum level of debt. This prevents accumulated core debt from being accumulated indefinitely and spent when convenient. Smoothing system load over short time horizons is desirable, but over infinite time horizons becomes dangerous.

This RFC introduces a new `HYPERCORES` parameter into the `HostConfiguration` which relay-chain governance uses to manage the parameters of the parachains protocol. Hypercores are inspired by technologies such as hyperthreading, to emulate multiple logical cores on a single phsyical core as resources permit. Hypercores allow parachains to make up for missed scheduling opportunities, which is important to effectively decouple parachain growth from backing on the relay chain.
Copy link
Contributor

@sandreim sandreim Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why HYPERCORES instead of simply some separate burstable or reserved cores?
Parachains can only be assigned such a core when they need to make up for missed scheduling opportunities.
I think it to be a simpler way of solving (3).

Also I am not a fan of hyperthreading tech, it's a something known for it's security flaws and currently the recommendation is to disable it.

WDYM by decouple parachain growth from backing on the relay chain, AFAIK all parachain blocks should be backed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with calling them "burst" cores or something like it.

WDYM by decouple parachain growth from backing on the relay chain, AFAIK all parachain blocks should be backed.

Specifically it refers to avoiding scheduling friction between multiple chains sharing the same core. Burst cores are a way to prevent explicit friction by trading off temporary system performance. This is much better than trying to account for that friction, which would need to cascade all the way into validators' and collators' logic. i.e. we decouple the growth of independent parachains sharing the same core.


### Changes to approval checking

Approval-checking is altered to support core: it accommodates multiple blocks being made available on the same core at the same time, and samples selection for these blocks based on the number of regular cores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be only needed by the hypercore approach, but I expected something similar should be introduced by a flavour of what's discussed here: https://github.com/paritytech/polkadot/issues/7441 .

Copy link
Contributor Author

@rphmeier rphmeier Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep (we should RFC that as well - I mentioned it in the Future work section)

Copy link

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thoughts:

Integration with on-demand

If we wanted the scheduler to use regions even for on-demand, how would this look like?

We could build multiple regions over let's say 5 blocks, each using a 5th of the capacity. If there is low demand, we might only have e.g. 3 of such regions or 1, which is fine as long as they are allowed to produce candidates immediately. If more candidates trickle in, we can add more regions afterwards starting at later blocks.

So at each block we would check whether we have orders queued, if so, we would look at the current active regions on the on-demand cores and figure out which ones still have capacity to add more regions.

So we would be:

  1. Iterating over all on-demand cores.
  2. Collect the currently active regions on those cores, adding up the ratios to figure out whether capacity is left. If so, add more regions and remove the corresponding orders.

We would also need to prune dead regions each block.

How it currently works in comparison:

  1. Iterate over all the cores.
  2. We iterate over the cores (claim queues) and pop from the assignment provider (the order queue in the case of on-demand) as many orders as we have free spots in the claim queue.

So this seems rather similar. Initially if seemed integrating with on-demand might be a bit cumbersome, but it seems fine now. I don't see any benefits when it comes to scheduling overhead though.

Exposing regions as a new primitive

While I can imagine regions as depicted to be useful in handling bulk assignments, especially when it comes to manipulation and markets, but also in scheduling itself, I am having trouble seeing the benefits of exposing the primitive to validators and collators, which is a major change in how things work.

Mapping the existing implementation to requirements

  1. I believe the scheduling overhead is already quite low, also we don't intend to handle tens of thousands of cores, but only parachains. The number of parachains existing does not really matter for the scheduling overhead. Inefficiencies which might exist should be fixable even without a new primitive. I don't think it would become a bottleneck any time soon. We have inefficient weights, we do lots of signature checking, we never really optimized any code and I have never seen scheduling to be an issue in performance.
  2. For a given relay chain parent you have x cores for para y. We just accept x candidates, which core we use for which should not matter. Processes on a computer also don't care on which cores they are run. For forks and such, I don't see how regions directly solve this problem. If we wanted to be explicit for any reason, we could equally well put the core index in the receipt.
  3. For actually missed opportunities another para can cover your spot and you can try again the next block - you are not put back at the end of a very long list. For availability timing out: We do have a configurable retry. There is no fundamental difference here between a core that is shared and one that is not.
  4. Indeed we are lacking primitives that can be sold on markets and scheduled. Regions as described here or RFC 1 would fit the bill. They can easily be integrated in the current design as an assignment provider.
  5. Very low latency is handled by on-demand. Low latency bulk is not available in the current implementation.
  6. Fully covered by the current implementation.
  7. Fully covered, they are not allowed to accumulate coretime at all.
  8. In my point of view the current interface, based on cores is already independent on how paras are scheduled. The scheduler itself also does not care about different ordering mechanisms. It only cares about assignments. How these assignments come to existence, does not matter. Whether it is regions as described here or in RFC 1 or on-demand order queue or anything else.

1. **The solution MUST gracefully handle tens of thousands of parachains without significant runtime scheduling overhead.**
- To enable world-scale, highly agile coretime, as much scheduling overhead as possible should be pushed to validators' native implementations, or even better collators'.
2. **The solution MUST minimize the complexity of determining a mapping between upcoming blocks of a parachain when the parachain is scheduled on many Execution Cores simultaneously.**
- Without this, the validators assigned to those execution cores will have no way of determining which upcoming blocks they are responsible for, and will waste resources. This is a practical requirement for e.g. elastic scaling an application to 2, 8, or 16 simultaneous cores.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get this part. Validators validate whatever the collators send them, as long as their ParaId is assigned to the core they are responsible for right now. Which is unambiguous based on the relay parent of the candidate.

How are resources wasted?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, actually if we go with core groups then it makes sense: In this case a single backing group would be responsible for multiple cores, if more than one was assigned to the same para id, it would indeed not be determined onto which core a collation should be mapped. But does it matter? If the same para id were assigned to x cores, we would simply accept x candidates - yes our assumption about collators would need to change a bit, PoW parachains might have a hard time of using simultaneous cores, but I don't see how regions would help here directly:

Even without core groups, if collators create random forks and we back unrelated forks in different backing groups, we would sign statements and distribute them while they can not possibly get backed both. That is indeed stupid, but resolving this requires collators to not be stupid: I don't see a fundamental difference between collators being smart and sending sane collations to groups vs. them being smart, assigning the correct region id and sending to the correct group. Regions themselves would not prevent them at all from creating unrelated forks on two different regions.

In a nutshell, I still don't get the wasted effort part.

Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're overcomplicating things with core groups.

Just take elastic scaling: Imagine a process using 8 cores. There are 8 backing groups theoretically assigned to the ParaId.

There needs to be some way to hint which parachain candidate is supposed to go on which core, so they don't all work on all 8 candidates. And it shouldn't depend on which nodes validators are communicating with.

Let's take a chain of 4 candidates [a b c d], all for a single parachain with 4 full cores: [1 2 3 4].

Validators can't work to back one candidate until they are at least aware of all the others before it.

I'll set down some goals:

Goal 1: optimistically, only one backing group should be validating one of [a b c d].
Goal 2: all backing groups must know each candidate
Goal 3: collators should not have to predict the state of each core when it creates its blocks. For the simple reason that it's a moving target with incomplete information
Goal 4: we should tolerate inactive or malicious backing groups


Case 1:

validators validate whatever the collators send them
collators are smart and only send candidates to the groups they want to work on them

If we imagine that collators want Group 1 to work on A, so they only send A to Group 1, then a malicious or offline Group A can prevent Groups 2, 3, and 4 from learning about the candidate.

Goal 4 failed, Goal 2 failed as a result.


Case 2:

validators validate whatever the collators send them
collator nodes are careful and send all candidates to all the groups

or

some node posing as a collator sends all candidates to all the groups

Goal 1: failed


Case 3:

validators take hints from the CandidateReceipt or some PrePVF about the intended core for the candidate.
collators send all candidates to all collators

If we assume all backing groups [1, 2, 3, 4] are honest and online, we succeed.

If some backing group b is offline, e.g. [1, (), 3, 4], then 3 and 4 can step in for candidates B and C, respectively. They could do this as long as the hint is not a hard requirement, which it should not be. Partial success with minimal duplication of effort.


Case 4:

validators validate whatever the collators send them
collators are smart and only send candidates to the groups they want to work on them
no groups are offline

Some cores are occupied. Even though all validators got the candidates they were supposed to, the whole chain is held up by some other parachain. The candidates expire useless and the collators go back to the drawing board. From the parachain's perspective, this is effectively the same as a group being offline.

Goal 3 failed


Case 5:

validators validate whatever the collators send them
collators are smart and only send candidates to the groups they want to work on them
collators can't be impersonated, and only candidate authoring nodes can be collators

Validator group [1] is offline, so [2 3 4] just do nothing for a while but luckily the collators are really smart and notice this. They adjust and send [a b c] to [2 3 4].

In this case, things work, but there are some really deep assumptions about which nodes collators are. We're also pushing all the scheduling work completely to collators, relying on them to pick up on when backing validators are offline or censoring them.


Case 6:

validators don't validate whatever the collators send them, but exercise some caution
collators are stupid or malicious and send all candidates to all validators

What's the algorithm for "exercise some caution" here?

Well, we might imagine that validators construct some mapping between cores and the parachain blocks that are pending. They all take the same view of the relay chain, and look at claim queues for each chain.

They come to a mapping of [a: 2, b: 3, c: 1, d: 4] and each validate their respective block.
Unfortunately, the cores don't get freed up as predicted and this mapping turned out to be totally inefficient. What should they do? Stick with the existing mapping they calculated? Or try to calculate a new one?

We must have some proposal for what the algorithm is, and this proposal is my answer to that question, and finding the pitfalls in various algorithm definitions.


Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a nutshell, I still don't get the wasted effort part.
sending sane collations to groups

Should all 8 backing groups validate all 8 pending candidates every 6 seconds? i.e. should they all be wasting their resources doing the same work as all the other backing groups?

Are you assuming that collators are the same nodes who author blocks? Why? I don't believe this is a useful assumption to make.

resolving this requires collators to not be stupid

No, it requires collator authentication and giving untrusted collators the ability to instruct validators to do anything. Otherwise any node could just send the candidate to all the backing groups and waste their resources.

The parachain consensus itself should be able to hint which core a particular candidate is intended for. After all, it had to be aware of all the cores assigned to the parachain when doing authoring. It should be unambiguous regardless of how backers receive potential candidates.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some backing group b is offline, e.g. [1, (), 3, 4], then 3 and 4 can step in for candidates B and C, respectively.

How would they learn about the offline backing group and how would they coordinate on who is going to work on what? All solvable, I guess, but also over complicating. Backing groups are already a redundant setup: If we experienced issues with whole backing groups being malicious or offline we could always counteract by increasing the backing group size and thus increasing redundancy. With time disputes we would even be able to reduce the threshold to 1, meaning only one out of 5 needs to be live and honest for the backing group to be live and honest. I think this is already plenty of redundancy.

Some cores are occupied. Even though all validators got the candidates they were supposed to, the whole chain is held up by some other parachain.

I still think that it does not matter whether this is some other parachain or the same parachain, the issue is fundamentally the same: You miss out on an opportunity to author a block. Availability does not really care which parachain a candidate belongs to (I argued below that this is likely even true with a malicious backing group). Also to be clear, this is an argument for the burst part - correct?

They come to a mapping of [a: 2, b: 3, c: 1, d: 4] and each validate their respective block.
Unfortunately, the cores don't get freed up as predicted and this mapping turned out to be totally inefficient. What should they do? Stick with the existing mapping they calculated? Or try to calculate a new one?

Also fixed by burst - right?

No, it requires collator authentication and giving untrusted collators the ability to instruct validators to do anything. Otherwise any node could just send the candidate to all the backing groups and waste their resources.

The parachain consensus itself should be able to hint which core a particular candidate is intended for. After all, it had to be aware of all the cores assigned to the parachain when doing authoring. It should be unambiguous regardless of how backers receive potential candidates.

💡 Of course you are totally right. Ok this makes perfect sense now, thanks! If the core mapping is the result of the parachain consensus, this indeed helps, but actually it is still problematic: Whether or not the reported core is correct we only know after doing the validation. Therefore a malicious actor could still severely be messing with us and waste backer's resources. We would at least not send out statements, making it less likely for other validators in the same group to pick up the same wasted effort. Also the backers will immediately know that they have been messed with and can reduce that peer's reputation for example. So yes agreed, we need that (or something like it).

The parachain consensus itself should be able to hint which core a particular candidate is intended for. After all, it had to be aware of all the cores assigned to the parachain when doing authoring. It should be unambiguous regardless of how backers receive potential candidates.

Now I am finally getting where this goal is coming from. 🥇

So I think I am getting now where the burst requirement is coming from and I understand why we would like to have a core id or a region id in the candidate receipt. I am proposing an alternative to burst cores below, which I think better matches our actual intention and avoids the potential drawbacks. What I am not there yet, is whether we need a region id or whether a core id is enough. In the end it is a breaking change either way. You are arguing that a region id would be better as we then can have a direct mapping ParaId -> RegionIds. While with cores we could only have (Relay parent hash, ParaId) -> CoreIds. I would argue that for regions it should also be (relay parent hash, ParaId) -> RegionId as new regions might come into existence each block. For the claim queue approach the returned CoreIds would be all the cores where we have an upcoming assignment. We can, if we want to, actually completely ignore the ordering in the claim queue, because even if all paras would immediately provide their collations, all of them could be backed (assuming timely availability). This is because the claim queue is no longer than max depth in async backing. This is something I am missing in regions: They are not limited in size and have no ordering. Not sure how this is supposed to work?

- To enable world-scale, highly agile coretime, as much scheduling overhead as possible should be pushed to validators' native implementations, or even better collators'.
2. **The solution MUST minimize the complexity of determining a mapping between upcoming blocks of a parachain when the parachain is scheduled on many Execution Cores simultaneously.**
- Without this, the validators assigned to those execution cores will have no way of determining which upcoming blocks they are responsible for, and will waste resources. This is a practical requirement for e.g. elastic scaling an application to 2, 8, or 16 simultaneous cores.
3. **The solution SHOULD minimize the time for applications to make up for missed scheduling opportunities even when cores are highly shared**.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With async backing, missing an opportunity should already be way harder than it is now. In addition with async backing we do have more flexibility, e.g. if we have para ids [A,B] scheduled, we would prefer to back A if available, but if not backing B would also be acceptable - better than idling. Then when it is B turn we could back A if it is available by now, essentially reverting the order to [B,A].

In the claim queue design, validators would be incentivized to adhere to the order if possible, but if that is not possible they would still get more rewards for doing [B,A], instead of doing [_,B]. Where _ means doing nothing. This does not fully alleviate MEV, but at least the MEV would need to be greater than what the validator loses by flipping the order.

The overhead in the claim queue design is indeed a bit higher as we need to maintain the queue. Although the queue that needs to be manipulated is quite small: Around the max depth we allow in async backing. While just updating counters is surely less effort, not sure it matters at this scale, especially when comparing this to things like needed signature checks for backing statements.

Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With async backing, missing an opportunity should already be way harder than it is now.

And with elastic scaling, the negative effects of missing an opportunity grow significantly.

Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the claim queue design, validators would be incentivized to adhere to the order if possible,

Why is order important? How does it anchor to a key goal of the system? What is the edge of an order-based architecture over other architectures?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is order important? How does it anchor to a key goal of the system? What is the edge of an order-based architecture over other architectures?

  1. You have been concerned about MEV (based on ordering) and named it as on reason why a para might miss an opportunity. Having some pre-defined ordering could help making this harder.
  2. It allows bulk region buyers to be more explicit in how they want to share their core. E.g. [A, A, B, B], vs. [A, B, A, B], vs. ...
  3. As mentioned below (algorithm for determining whether a region is in surplus), I am not sure at the moment how things work at all if we don't have some definition of order.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And with elastic scaling, the negative effects of missing an opportunity grow significantly.

Because if only one candidate is missing availability, all the candidate in the next relay chain block are blocked? Or is there more?

In any case, I agree this sucks. Bundling them all up might help a bit, but does not solve the real issue you are highlighting here, which is: We have a very high profile para, so high profile that it is actually using multiple cores per relay chain block. Or in other words: "This is a premium customer! A premium customer with very high QoS requirements!"

A premium customer? Let's charge him premium! I am hereby proposing an alternative to burst cores: Premium Cores! Instead of blindly allowing any para to take a burst core (even those which don't actually care, but would potentially then affect service to others!), we split cores into cores with different QoS standards. Let's make them two for now: premium and standard. We then make it so that a premium core can consume existing standard cores if it missed an opportunity! Instead of oversubscribing or always under utilizing, premium cores just take precedence and are allowed to consume resources of standard cores (replacing their original assignment) if need be.

A three tier approach might be even more sensible:

  • Premium: Makes up for missed opportunities by displacing low-cost cores.
  • Best-effort: Does not make up for missed opportunities, but is never displaced.
  • Low-cost: Does not make up for missed opportunities and can be displaced.

Benefits:

  • Users can directly state their intent, requirements and we can cater to them. No one size fits them all.
  • We cater both to entry level (low cost) and enterprise with high QoS requirements.
  • Covering for a missed opportunity does not cause more work and no bursts to the network.

Our value proposition to cater to the full spectrum, from tinkerers up to heavy duty, high load enterprise just became better. We could even prioritize premium cores in availability to reduce the risk of a missed opportunity in the first place.


Coretime regions are a dynamic, multi-purpose mechanism for determining the assignment of parachains to a relay chain's Execution Cores. They replace existing scheduling logic in the Polkadot Relay Chain, and introduce a notion of BURST_CORES for accepting parachain blocks at flexible points in time.

Each region is a data structure assigned to a particular application, which indicates future rights to consume coretime and keeps records of how many resources have been consumed so far. Relay-chain coretime regions, unlike those in RFC-0001 are not intended to be traded or manipulated by users directly, but instead allocated by higher-level mechanisms. The details of those higher-level mechanisms are out of scope for this RFC. Parachains can own an arbitrary number of regions and are limited in block production only by the number and production rate of regions which they own. Each region belongs to a specific execution core, and cannot move between cores. Regions are referenceable by a unique 256-bit identifier within the relay-chain.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relay-chain coretime regions, unlike those in RFC-0001 are not intended to be traded or manipulated by users directly, but instead allocated by higher-level mechanisms.

You lost me here. I thought that this was one of the design goals, to be able to track regions/allocations from order to execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, but that part ended up in RFC-1. Doing trading and splitting is too heavy for the relay chain itself.

Copy link

@eskimor eskimor Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, using different definitions of regions seems off. That might be fine, if one is better suited for markets and one is better for scheduling, but so far, not even the ids would match. Maybe a reason to not yet vote on RFC-1?

6. **The solution MUST NOT allow chains to access more than their scheduled amount of Coretime.**
7. **The solution MUST NOT allow parachains to build up arbitrary amounts of Coretime to spend later on**
- The intention of regions is to ensure consistent rates of utilization by scheduled parachains. Allowing arbitrary amounts of Coretime to be built up and spent later will lead to misallocation of system resources during periods of high demand. Eliminating this type of arbitrage is necessary.
8. **The solution SHOULD unify all scheduling mechanisms on the relay-chain**. Maintaining multiple parallel interfaces and implementations of scheduling on the relay-chain, runtime APIs, and node-side will contribute enormously to implementation and ongoing maintenance overheads and should be avoided.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already achieved by the current design. Scheduler does not care how assignments are created and the paraid -> coreid mapping exposed to validators and collators is also completely independent of how those assignments came to existence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it just doesn't solve all the other requirements

// This may be `None`, in the case that the true endpoint is only determined later.
end: Option<u32>,
// The maximum amount of per-relay-chain block core resources which may be
// used by this region, expressed in parts of `RATE_DENOMINATOR`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. How does this relate to rate? Per-relay-chain block core resource? A region is per core, for a single relay-chain block it can either use its core or not.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Became clear further down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the stage for core bundling, e.g. https://github.com/paritytech/polkadot/issues/7441 where we post 3 parachain blocks that each use 1/3 of a core.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 so this is really for this very relay chain block? In that case ignore the comments below, where I try to make sense of this. To recap:

  • rate: how much of the regions resources we are allowed to use.
  • maximum: How much of the current blocks core resources we are allowed to use: Only makes sense with core groups as otherwise any value below 100% would stall the process.

So basically with maximum we configure how the core sharing works: Alternating 100% usage or 50% usage each block (assuming two paras sharing the core).

Regions are used to modify the behavior of the parachain backing/availability pipeline. The first major change is that parachain candidates submitted to the relay-chain in the `ParasInherent` will be annotated with the `RegionId` that they are intended to occupy. Validators and collators do the work of figuring out which blocks are assigned to which regions, lifting the burden of granular scheduling off of the relay chain.

The **maximum consumption** of a region at any block number `now` is given by:
`maximum_consumption(now, region) = (min(now, end) - start) * rate`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula would technically allow for a region to still be relevant after end, e.g. to allow it to catch up on missed opportunities (effective consumption < maximum_consumption). Do we want lenience to go beyond end? If so, I would assume regions can be "garbage collected" after end + SCHEDULING_LENIENCE, correct?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this does not work as intended: Let's assume we have x overlapping regions, each with rate 1/x. Then at the beginning of the regions none of these paras would be allowed to author a block. As min(now, end) would be now and now would equal start, hence the maximum consumption for each of these regions would be 0 at the beginning, all of them would need to wait x blocks in order to be allowed to back a candidate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'd need to wait x blocks in order to back a candidate that consumes 100% of the core's resources. With other changes like execution groups they would be able to build a candidate that consumes 1/x of the core's resources on the first block.

Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want lenience to go beyond end?

This is probably necessary for fairness.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'd need to wait x blocks in order to back a candidate that consumes 100% of the core's resources. With other changes like execution groups they would be able to build a candidate that consumes 1/x of the core's resources on the first block.

But how does this work then for 100% usages? I understood the region proposal as core sharing: E.g. you get the first relay chain block, I get the second, ... But with what you just said, this does not work: All processes would wait x blocks, non of them producing anything (resources are not used) and then after x blocks all of them would start producing blocks all at the same time, over allocating resources. What am I missing?

| Name | Constant | Value |
| ------------------- | -------- | --------- |
| RATE_DENOMINATOR | YES | 57600 |
| SCHEDULING_LENIENCE | NO | 16 |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 seems excessive. I am assuming this is rather large, so a group rotation can happen in between? But even then 16 seems a lot and could lead to quite some burstiness, by parachains using hyper cores over multiple blocks in a row, while idling before. (They could do this on purpose.)

Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is burstiness a problem if it's limited? Average system load is predictable. What is likely to happen with this burstiness is that latency and finality lag would increase temporarily while validators catch up on work. But it's not unbounded. As the bursts are constrained, they would be followed by either
a) periods of expected load (no further "bursts" being built up). Running slightly below capacity means that validators can catch up
b) periods of low load ("bursts" being built up). These low-load periods will be easy to catch up during.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we increase latency and finality lag we are degrading the service (potentially for even more paras than were affected by some availability lag). This contradicts the original goal of maintaining good and predictable service.

I agree with your analysis, it would only be a temporary issue, but so would be availability issues (hopefully, otherwise we have different problems).

Also we have to think what would be causing availability issues, a malicious or malfunctioning backing group are only one scenario. A scenario which is more likely (as already experienced in practice) is that there is some overload condition. Adding even more work in this situation is almost guaranteed to degrade the service even further.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think high load is actually the most likely reason for slow availability even longer term, because backers are incentivized to make the candidate available. If they fail to do so, they lose out on rewards while already having invested all the work.

If a backing group wanted to censor some para, it would most likely do so even before availability, but they are still incentivized to back something. So they would very likely back some other eligible para, but then that para got served and its effective usage increased, therefore a future core just got free for that censored para.

In the last and final case (I can think of right now), the backing group being offline/malfunctioning: We already have redundancy. If this proved to be an issue in practice we can always increase the backing group size.

So let's ignore the malicious case for a bit. Assuming honest nodes/normal operation: We should strive for slow availability being exceptional, even if we had burst cores. As even with a perfectly fine working burst core, the para would still play catch up and would not get the intended block times.

If we found availability to be problematic, a potential consensus backing group system could be beneficial here as well: If we knew in advance what is going to be backed (more or less reliably) we can afford to also start availability asynchronously and earlier than we do now. We could even let the backing group commit ahead of time (via the hash) to the next thing they are going to back. If we bundle that up with the sent statement, then this would work without additional signatures.


A submitted candidate which uses `p: PartsOf57600` of the core's resources for a parachain P at a block B is accepted if:
* There is no candidate pending availability for B
* `effective_consumption(B, region) + p <= maximum` if `maximum` is `Some`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok with this maximum becomes a bit more clear, in that it is the consumption up until that relay-chain block. Still not sure why this is needed on top of maximum_consumption. Uuuh! Is this meant as an alternative to specifying end? Ok, that makes sense now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maximum is the total maximum of the region, in case it needs to be less than the implicit maximum given by end.

e.g. an on-demand purchase might have a high rate and a low maximum.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For on-demand, I am just realizing that every time I think of it, I map it differently to regions.
There is some design space, I can think of at least two options:

  • Regions of size 1, with rate 1 (57600/57600) and len 1: Should work with lenience beyond end.
  • Overlapping regions of size x, with rate 1/x.

The second option is more flexible and potentially more efficient, so let's go with it: Let's assume region size x or let's be even more concrete, let's make it 5: What values would you pick for rate and maximum? I would assume rate 1/5th and maximum? Same - answered in pseudo code below.

This maximum is the total maximum of the region, in case it needs to be less than the implicit maximum given by end.

How is this different from setting a low rate? Because usage of rate is limited by lenience, while using the maximum_consumption is not? So one could effectively increase lenience this way? I don't think we want that and also given that total core usage is restricted to 1, this still does not make sense (Higher than used rate would limit other overlapping regions). 🤔

I stand by my interpretation, that it is an alternative to specifying end directly. Due to (limited) lenience, setting either should be pretty much equivalent - especially if we allowed lenience crossing end. Might be an argument for not allowing that, then there are options:

  1. For use cases where timing is important, you set end and enforce it strictly.
  2. For use cases, where a core is highly shared (you want to ensure fairness) and timing (of future regions) is not important ( e.g. on-demand), you would set maximum_consumption instead of end.


How core resource limits are defined is left beyond the scope of this RFC - at the time of writing, all cores have the same resource limits, but this design allows cores to be specialized in their resource limits, with some cores allowing more data, some allowing more granularity or execution time, etc.

If all of these conditions are met, along with other validity conditions for backed candidates beyond the scope of this RFC, then the candidate is pending availability and the region's consumption value is incremented to `effective_consumption(B, region) + p` If the candidate times out before becoming available, the consumption value is reduced by `p`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply postpone incrementing until availability is reached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again due to elastic scaling. If we have a new RFC that allows multiple pending availability candidates, the later candidates have to be restricted by the resource consumption of the prior candidates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect is the same, though. There are a few ways to look at it and we could leave this section general.


The scheduling lenience allows regions to fall behind their expected tickrate, limited to a small maximum level of debt. This prevents accumulated core debt from being accumulated indefinitely and spent when convenient. Smoothing system load over short time horizons is desirable, but over infinite time horizons becomes dangerous.

This RFC introduces a new `BURST_CORES` parameter into the `HostConfiguration` which relay-chain governance uses to manage the parameters of the parachains protocol. BURST_CORES are inspired by technologies such as hyperthreading, to emulate multiple logical cores on a single physical core as resources permit. BURST_CORES allow parachains to make up for missed scheduling opportunities, which is important to effectively decouple parachain growth from backing on the relay chain.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still dubious whether this is really needed/good:

  1. If resources have not been used before, they are lost. We can not use unused CPU time or network resources of the past. This means, we either have to under-utilize all the time in order to be able to handle a burst core or we suffer from service degradation when used, slowing things down even more. Which would lead to burst cores getting used in the next block again, retaining the degradation indefinitely (or at least over a prolonged period).
  2. While this RFC limits the the amount of catch up a parachain can play, even the tiniest bit can be abused and trigger (1). Parachains could on purpose always run late in order to make use of BURST_CORES which would be detrimental to the system.

Without BURST_CORES we still have the flexibility of backing another available candidate (of a different allowed task), if a para runs late, allowing it to get backed the next block instead. This is effectively using resources, without creating bursts.

For the case of validators deliberately censoring a parachain: We do have group rotations, but indeed if a para is only scheduled once per hour, this would even with the above lenience likely mean to completely miss the opportunity. But even in this case the para could fix this by itself by ordering an on-demand core for making up for the missed opportunity caused by a malicious backing group.

Granted it does not sound fair for a para to have to buy an additional core, but the backers are incentivized to back candidates, so they can be made to lose out on rewards. On top of that you need multiple colluding backers to pull this off, which reduces the likelihood of this happening, at least with sufficient decentralization. A censoring block producer on the other hand can only censor for one block.

So while also not perfect, paras do have means for handling censoring backing groups, while it would be harder to handle untrusted collators for abusing burst cores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If resources have not been used before, they are lost. We can not use unused CPU time or network resources of the past. This means, we either have to under-utilize all the time

Yes, slightly. Probably not substantially.

Parachains could on purpose always run late in order to make use of BURST_CORES which would be detrimental to the system.

Validators would be getting an easier job while they're running late. The average load is what's important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it would be harder to handle untrusted collators for abusing burst cores.

They are limited by BURST_CORES_PER_CORE, as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the key goals of the RFC is to reduce friction between parachains that are scheduled on the same core. At the moment, there is no friction because each parachain has a single dedicated core.

Reducing or eliminating friction should be a key goal of the system, as the system's value is driven by the quantity of processes using it and the efficient allocation of resources across those processes.

Copy link
Contributor Author

@rphmeier rphmeier Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main reasons for this friction is that candidates pin to a specific relay-parent, and that the set of allowed relay parents expires relatively quickly.

But even in this case the para could fix this by itself by ordering an on-demand core for making up for the missed opportunity caused by a malicious backing group

This would be a terrible experience for users - it reduces the value proposition of acquiring coretime in the first place.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main reasons for this friction is that candidates pin to a specific relay-parent, and that the set of allowed relay parents expires relatively quickly.

How is this fixed with regions? (answered myself below)

This would be a terrible experience for users - it reduces the value proposition of acquiring coretime in the first place.

Not too terrible (I think) and is already defense in depth. Validators are invested and incentivized to operate according to the protocol. Having an escape hatch (ordering an on-demand core automatically) in the case all this does not help, does not seem too bad to me. Ordering an on-demand core can be automated and should be friction less, after all this is our low-barrier entry point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this fixed with regions?

Via burst cores: Candidate simply does not expire, because it can be backed regardless.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing or eliminating friction should be a key goal of the system, as the system's value is driven by the quantity of processes using it and the efficient allocation of resources across those processes.

I think there is only friction if we have a strict end block number for a region. There was the argument for very low frequency parachains, that the consequences of missing out, are potentially more severe. E.g. let's assume one block per hour: Missing that one block might be a bigger problem, than for a six second parachain to miss out on a block. I don't think this assumption is necessarily true: A high load 6s block time parachain might suffer more than a real low use 1 hour parachain. Anyhow there is a second argument: If you only produce a block every other hour and the region lasts for 10 minutes, you might have ordered a region and did not author a single block, so you lost 100% of your "investment in core time", while a 6s parachain only lost 1% of their investment. But this does not add up either. The 1 hour parachain still has 100 opportunities to author a single block, while for the 6s parachain every missed opportunity would be an actual loss.

So in a nutshell, while I can understand the motivation to make up for missed opportunities due to slow availability, I don't get how this relates to friction of shared usage. If the core was not shared and availability takes longer, that single parachain would also suffer from longer block times. Currently we even have retries on availability timeouts, so if that 1hour parachain got affected by an availability timeout, it would even be retried in a guaranteed fashion, no matter any other sharings.

This also brings me to another thing. I just re-read the RFC and the forum post, but could not find an answer: Assuming we have a core shared with multiple regions: Technically they are all equal, so all collators of all those processes could try to provide a collation all at the same time, while we clearly cannot support this (at least not always). Consider a region of length 100, with 1/100 regions on them - that would not even work with burst cores. Validators can decide what to fetch, but the very least the collators are wasting resources preparing something that is never fetched (probably repeatedly so, until all the other processes got served) and also validators would need to coordinate what collations to fetch (and validate) at any given time. I guess, I just don't properly understand yet how this probabilistic scheduling actually works.

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 4, 2023

For a given relay chain parent you have x cores for para y. We just accept x candidates, which core we use for which should not matter. Processes on a computer also don't care on which cores they are run. For forks and such, I don't see how regions directly solve this problem. If we wanted to be explicit for any reason, we could equally well put the core index in the receipt.

I hadn't made it completely clear in the RFC text. Of course this is the end result we want.

Hidden complexity; here be dragons: #3 (comment)

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 4, 2023

If we wanted the scheduler to use regions even for on-demand, how would this look like?

First off, there's no distinction between an on-demand core and a non-on-demand core.

There are only workloads per core, defined in paritytech/polkadot#5 .

oversimplified pseudocode, but here's what I'm imagining, for instance:

enum CoreAssignment {
  Task(ParaId),
  InstantaneousPool,
}

struct OnDemandTracker {
  last_reset: BlockNumber,
  consumed: RationalOver57600,
  rate: PartOf57600,
}

impl OnDemandTracker {
  fn has_space(&self, needed: RationalOver57600, now: BlockNumber) -> bool {
     ((now.saturating_sub(self.last_reset)) * self.rate) - self.consumed >= needed
  }

  fn consume(&mut self, used: RationalOver57600) {
    self.consumed += used;
  }
}

storage OnDemand: StorageMap<CoreId, OnDemandTracker>;

// code implementing RFC-5
fn assign_core(core_id: CoreId, Vec<(CoreAssignment, PartOf57600)>) {
  // create regions directly for all `CoreAssignment::Task`.

  let instantaneous_rate: PartOf57600 = ...; // from `CoreAssignment::Instantaneous`
  if instantanous_rate == 0 {
    OnDemand::remove(core_id);
  } else {
    OnDemand::insert(core_id, OnDemandTracker { last_reset: now, consumed: 0, rate: instantaneous_rate });
  }
}

fn schedule_on_demand(para_id: ParaId) {
  const WINDOW_SIZE = 5;

  let needed_resources = 57600; // assume entire core is consumed for one relay-chain block in total.

  // find first core with enough available resources
  // for illustration only.
  // in practice, iteration is not efficient. a linked list of cores with enough resources could be used, 
  // with one iteration per block to bring cores back into the linked list.
  for (core_id, resources) in OnDemand::iter_mut() {
    if !resources.has_space(needed_resources) { continue }
    resources.consume(needed_resources);

    let start = now;
    let end = start + WINDOW_SIZE;

    // create a region that allows the parachain to make one block in the next 5.
    // this code could easily be adapted to allow for multiple on-demand blocks to be purchased at once.
    create_region(Region {
      core: core_id,
      schema: RegionSchema { start, end, maximum: Some(needed_resources), rate: needed_resources },
      consumption: 0,
      assignee: para_id,
    });
    break;
  }
}

// Updated when regions are created, transfered, or collected.
// This is required for runtime APIs or other higher-level logic on collators to iterate the regions assigned
// to a specific para.
storage double_map RegionsPerPara: (ParaId, RegionId) -> ();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that as Gavin already mentioned, if we used region ids as defined in RFC-1, collators would have all the information they need with just a single lookup. On the other hand having non opaque RegionIds does not seem very future proof. E.g. there is no versioning in RFC-1 RegionIds, therefore it is going to be troublesome to update the format. Also if we ever found to need more information or information in a different format, key size might need to grow or we lose the self-describing property.

Given that this RFC proposes 256bit ids, while RFC-1 proposes 128bit ids those concerns might be moot though. As we thus have leeway and space to add a version for example, without requiring more space than ids here.


The scheduling lenience allows regions to fall behind their expected tickrate, limited to a small maximum level of debt. This prevents accumulated core debt from being accumulated indefinitely and spent when convenient. Smoothing system load over short time horizons is desirable, but over infinite time horizons becomes dangerous.

This RFC introduces a new `BURST_CORES` parameter into the `HostConfiguration` which relay-chain governance uses to manage the parameters of the parachains protocol. BURST_CORES are inspired by technologies such as hyperthreading, to emulate multiple logical cores on a single physical core as resources permit. BURST_CORES allow parachains to make up for missed scheduling opportunities, which is important to effectively decouple parachain growth from backing on the relay chain.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main reasons for this friction is that candidates pin to a specific relay-parent, and that the set of allowed relay parents expires relatively quickly.

How is this fixed with regions? (answered myself below)

This would be a terrible experience for users - it reduces the value proposition of acquiring coretime in the first place.

Not too terrible (I think) and is already defense in depth. Validators are invested and incentivized to operate according to the protocol. Having an escape hatch (ordering an on-demand core automatically) in the case all this does not help, does not seem too bad to me. Ordering an on-demand core can be automated and should be friction less, after all this is our low-barrier entry point.


### Changes to backing/availability

Regions are used to modify the behavior of the parachain backing/availability pipeline. The first major change is that parachain candidates submitted to the relay-chain in the `ParasInherent` will be annotated with the `RegionId` that they are intended to occupy. Validators and collators do the work of figuring out which blocks are assigned to which regions, lifting the burden of granular scheduling off of the relay chain.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see the immediate value for having multiple overlapping regions on the same core, as those could be fused into a single region. On second thought, assuming regions have a significant lifespan, it might of course make sense to use a second half of a core for a fraction of the time for example, which would be modeled by having two overlapping regions on the same core with the same ParaId.

So I agree for maximum flexibility this seems useful! Although, I don't think allocation efficiency would suffer too much if we forbid this. (Not saying we should, but I think not allowing this, could be an acceptable point in the design space, if it provided other benefits.)


A submitted candidate which uses `p: PartsOf57600` of the core's resources for a parachain P at a block B is accepted if:
* There is no candidate pending availability for B
* `effective_consumption(B, region) + p <= maximum` if `maximum` is `Some`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For on-demand, I am just realizing that every time I think of it, I map it differently to regions.
There is some design space, I can think of at least two options:

  • Regions of size 1, with rate 1 (57600/57600) and len 1: Should work with lenience beyond end.
  • Overlapping regions of size x, with rate 1/x.

The second option is more flexible and potentially more efficient, so let's go with it: Let's assume region size x or let's be even more concrete, let's make it 5: What values would you pick for rate and maximum? I would assume rate 1/5th and maximum? Same - answered in pseudo code below.

This maximum is the total maximum of the region, in case it needs to be less than the implicit maximum given by end.

How is this different from setting a low rate? Because usage of rate is limited by lenience, while using the maximum_consumption is not? So one could effectively increase lenience this way? I don't think we want that and also given that total core usage is restricted to 1, this still does not make sense (Higher than used rate would limit other overlapping regions). 🤔

I stand by my interpretation, that it is an alternative to specifying end directly. Due to (limited) lenience, setting either should be pretty much equivalent - especially if we allowed lenience crossing end. Might be an argument for not allowing that, then there are options:

  1. For use cases where timing is important, you set end and enforce it strictly.
  2. For use cases, where a core is highly shared (you want to ensure fairness) and timing (of future regions) is not important ( e.g. on-demand), you would set maximum_consumption instead of end.

`max(region.consumption, minimum_consumption(now, region))`

A submitted candidate which uses `p: PartsOf57600` of the core's resources for a parachain P at a block B is accepted if:
* There is no candidate pending availability for B
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Burst cores are supposed to solve the issue of collators should not need to predict whether a core is going to be free or not. Ok, agreed this reduces wasted effort in that case, at the expense of higher burstiness on approval voting and availability. We are already seeing burstiness to cause issues, so increasing burstiness still sounds troublesome to me.

But, I am understanding the motivation better now. Right now for handling burstiness we increase channel sizes, but those are already quite ridiculously large. Increasing their size further increases memory pressure and latency. Both can cause other issues.

| Name | Constant | Value |
| ------------------- | -------- | --------- |
| RATE_DENOMINATOR | YES | 57600 |
| SCHEDULING_LENIENCE | NO | 16 |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think high load is actually the most likely reason for slow availability even longer term, because backers are incentivized to make the candidate available. If they fail to do so, they lose out on rewards while already having invested all the work.

If a backing group wanted to censor some para, it would most likely do so even before availability, but they are still incentivized to back something. So they would very likely back some other eligible para, but then that para got served and its effective usage increased, therefore a future core just got free for that censored para.

In the last and final case (I can think of right now), the backing group being offline/malfunctioning: We already have redundancy. If this proved to be an issue in practice we can always increase the backing group size.

So let's ignore the malicious case for a bit. Assuming honest nodes/normal operation: We should strive for slow availability being exceptional, even if we had burst cores. As even with a perfectly fine working burst core, the para would still play catch up and would not get the intended block times.

If we found availability to be problematic, a potential consensus backing group system could be beneficial here as well: If we knew in advance what is going to be backed (more or less reliably) we can afford to also start availability asynchronously and earlier than we do now. We could even let the backing group commit ahead of time (via the hash) to the next thing they are going to back. If we bundle that up with the sent statement, then this would work without additional signatures.

- To enable world-scale, highly agile coretime, as much scheduling overhead as possible should be pushed to validators' native implementations, or even better collators'.
2. **The solution MUST minimize the complexity of determining a mapping between upcoming blocks of a parachain when the parachain is scheduled on many Execution Cores simultaneously.**
- Without this, the validators assigned to those execution cores will have no way of determining which upcoming blocks they are responsible for, and will waste resources. This is a practical requirement for e.g. elastic scaling an application to 2, 8, or 16 simultaneous cores.
3. **The solution SHOULD minimize the time for applications to make up for missed scheduling opportunities even when cores are highly shared**.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is order important? How does it anchor to a key goal of the system? What is the edge of an order-based architecture over other architectures?

  1. You have been concerned about MEV (based on ordering) and named it as on reason why a para might miss an opportunity. Having some pre-defined ordering could help making this harder.
  2. It allows bulk region buyers to be more explicit in how they want to share their core. E.g. [A, A, B, B], vs. [A, B, A, B], vs. ...
  3. As mentioned below (algorithm for determining whether a region is in surplus), I am not sure at the moment how things work at all if we don't have some definition of order.

- To enable world-scale, highly agile coretime, as much scheduling overhead as possible should be pushed to validators' native implementations, or even better collators'.
2. **The solution MUST minimize the complexity of determining a mapping between upcoming blocks of a parachain when the parachain is scheduled on many Execution Cores simultaneously.**
- Without this, the validators assigned to those execution cores will have no way of determining which upcoming blocks they are responsible for, and will waste resources. This is a practical requirement for e.g. elastic scaling an application to 2, 8, or 16 simultaneous cores.
3. **The solution SHOULD minimize the time for applications to make up for missed scheduling opportunities even when cores are highly shared**.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And with elastic scaling, the negative effects of missing an opportunity grow significantly.

Because if only one candidate is missing availability, all the candidate in the next relay chain block are blocked? Or is there more?

In any case, I agree this sucks. Bundling them all up might help a bit, but does not solve the real issue you are highlighting here, which is: We have a very high profile para, so high profile that it is actually using multiple cores per relay chain block. Or in other words: "This is a premium customer! A premium customer with very high QoS requirements!"

A premium customer? Let's charge him premium! I am hereby proposing an alternative to burst cores: Premium Cores! Instead of blindly allowing any para to take a burst core (even those which don't actually care, but would potentially then affect service to others!), we split cores into cores with different QoS standards. Let's make them two for now: premium and standard. We then make it so that a premium core can consume existing standard cores if it missed an opportunity! Instead of oversubscribing or always under utilizing, premium cores just take precedence and are allowed to consume resources of standard cores (replacing their original assignment) if need be.

A three tier approach might be even more sensible:

  • Premium: Makes up for missed opportunities by displacing low-cost cores.
  • Best-effort: Does not make up for missed opportunities, but is never displaced.
  • Low-cost: Does not make up for missed opportunities and can be displaced.

Benefits:

  • Users can directly state their intent, requirements and we can cater to them. No one size fits them all.
  • We cater both to entry level (low cost) and enterprise with high QoS requirements.
  • Covering for a missed opportunity does not cause more work and no bursts to the network.

Our value proposition to cater to the full spectrum, from tinkerers up to heavy duty, high load enterprise just became better. We could even prioritize premium cores in availability to reduce the risk of a missed opportunity in the first place.

1. **The solution MUST gracefully handle tens of thousands of parachains without significant runtime scheduling overhead.**
- To enable world-scale, highly agile coretime, as much scheduling overhead as possible should be pushed to validators' native implementations, or even better collators'.
2. **The solution MUST minimize the complexity of determining a mapping between upcoming blocks of a parachain when the parachain is scheduled on many Execution Cores simultaneously.**
- Without this, the validators assigned to those execution cores will have no way of determining which upcoming blocks they are responsible for, and will waste resources. This is a practical requirement for e.g. elastic scaling an application to 2, 8, or 16 simultaneous cores.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some backing group b is offline, e.g. [1, (), 3, 4], then 3 and 4 can step in for candidates B and C, respectively.

How would they learn about the offline backing group and how would they coordinate on who is going to work on what? All solvable, I guess, but also over complicating. Backing groups are already a redundant setup: If we experienced issues with whole backing groups being malicious or offline we could always counteract by increasing the backing group size and thus increasing redundancy. With time disputes we would even be able to reduce the threshold to 1, meaning only one out of 5 needs to be live and honest for the backing group to be live and honest. I think this is already plenty of redundancy.

Some cores are occupied. Even though all validators got the candidates they were supposed to, the whole chain is held up by some other parachain.

I still think that it does not matter whether this is some other parachain or the same parachain, the issue is fundamentally the same: You miss out on an opportunity to author a block. Availability does not really care which parachain a candidate belongs to (I argued below that this is likely even true with a malicious backing group). Also to be clear, this is an argument for the burst part - correct?

They come to a mapping of [a: 2, b: 3, c: 1, d: 4] and each validate their respective block.
Unfortunately, the cores don't get freed up as predicted and this mapping turned out to be totally inefficient. What should they do? Stick with the existing mapping they calculated? Or try to calculate a new one?

Also fixed by burst - right?

No, it requires collator authentication and giving untrusted collators the ability to instruct validators to do anything. Otherwise any node could just send the candidate to all the backing groups and waste their resources.

The parachain consensus itself should be able to hint which core a particular candidate is intended for. After all, it had to be aware of all the cores assigned to the parachain when doing authoring. It should be unambiguous regardless of how backers receive potential candidates.

💡 Of course you are totally right. Ok this makes perfect sense now, thanks! If the core mapping is the result of the parachain consensus, this indeed helps, but actually it is still problematic: Whether or not the reported core is correct we only know after doing the validation. Therefore a malicious actor could still severely be messing with us and waste backer's resources. We would at least not send out statements, making it less likely for other validators in the same group to pick up the same wasted effort. Also the backers will immediately know that they have been messed with and can reduce that peer's reputation for example. So yes agreed, we need that (or something like it).

The parachain consensus itself should be able to hint which core a particular candidate is intended for. After all, it had to be aware of all the cores assigned to the parachain when doing authoring. It should be unambiguous regardless of how backers receive potential candidates.

Now I am finally getting where this goal is coming from. 🥇

So I think I am getting now where the burst requirement is coming from and I understand why we would like to have a core id or a region id in the candidate receipt. I am proposing an alternative to burst cores below, which I think better matches our actual intention and avoids the potential drawbacks. What I am not there yet, is whether we need a region id or whether a core id is enough. In the end it is a breaking change either way. You are arguing that a region id would be better as we then can have a direct mapping ParaId -> RegionIds. While with cores we could only have (Relay parent hash, ParaId) -> CoreIds. I would argue that for regions it should also be (relay parent hash, ParaId) -> RegionId as new regions might come into existence each block. For the claim queue approach the returned CoreIds would be all the cores where we have an upcoming assignment. We can, if we want to, actually completely ignore the ordering in the claim queue, because even if all paras would immediately provide their collations, all of them could be backed (assuming timely availability). This is because the claim queue is no longer than max depth in async backing. This is something I am missing in regions: They are not limited in size and have no ordering. Not sure how this is supposed to work?


Coretime regions are a dynamic, multi-purpose mechanism for determining the assignment of parachains to a relay chain's Execution Cores. They replace existing scheduling logic in the Polkadot Relay Chain, and introduce a notion of BURST_CORES for accepting parachain blocks at flexible points in time.

Each region is a data structure assigned to a particular application, which indicates future rights to consume coretime and keeps records of how many resources have been consumed so far. Relay-chain coretime regions, unlike those in RFC-0001 are not intended to be traded or manipulated by users directly, but instead allocated by higher-level mechanisms. The details of those higher-level mechanisms are out of scope for this RFC. Parachains can own an arbitrary number of regions and are limited in block production only by the number and production rate of regions which they own. Each region belongs to a specific execution core, and cannot move between cores. Regions are referenceable by a unique 256-bit identifier within the relay-chain.
Copy link

@eskimor eskimor Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, using different definitions of regions seems off. That might be fine, if one is better suited for markets and one is better for scheduling, but so far, not even the ids would match. Maybe a reason to not yet vote on RFC-1?

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 8, 2023

GitHub is broken so I can't respond in-line, but.

Still, using different definitions of regions seems off. That might be fine, if one is better suited for markets and one is better for scheduling, but so far, not even the ids would match. Maybe a reason to not yet vote on RFC-1?

They serve different purposes. One is for markets, the other is for off-chain scheduling which is quickly verifiable on-chain. RFC-1 wasn't designed with collator/validator interaction in mind so there are necessarily different goals and designs for those.

How would they learn about the offline backing group and how would they coordinate on who is going to work on what? All solvable, I guess, but also over complicating

That was the point of the socratic dialogue. That hidden complexity lies in the assumption that "collators and validators are smart". I'm not saying this is easy, but backers detecting that other groups aren't working is strictly easier. Though this is all probably moot given https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629?u=rphmeier as we don't truly need to introduce more backing groups in order to schedule more cores.

Burst cores also are an artifact that are worked around in the above proposal. It's true that we need to limit load, but asking chains to order extra on-demand cores just to use coretime they already paid for and avoid the "roadblocks" of occupied cores feels like such a non-starter that we should not even consider it. If the system is overloaded and coretime was oversold they should ideally end up getting refunds, not paying more. Would you use an airline which overbooked seats, refused you the right to board, and then offered to let you on if you paid extra money? This is much less of an issue when we remove core affinity, so I will attempt to rework this proposal without any core affinity whatsoever.

They are not limited in size and have no ordering. Not sure how this is supposed to work?

Why is ordering important? Only because of friction between parachains scheduled on the same core. And even so, it is a very incomplete solution. Removing that friction is done in this proposal by adding burst cores, therefore ordering is unimportant. If core affinity is removed altogether, then this becomes even less important, though there is still some friction, the effect of this friction on particular parachains averages out system-wide.

I think high load is actually the most likely reason for slow availability even longer term, because backers are incentivized to make the candidate available. If they fail to do so, they lose out on rewards while already having invested all the work.

Practically speaking, with our approvals protocol the amount of work which any particular validator must do is only equal in expectation. Sometimes validators may be selected to do much more work, and sometimes much less. The distribution of workload in approvals for validators is Gaussian. Therefore, validators already typically run below "capacity", as they may end up drawing a heavy workload from the far right tail of the distribution and this must not cause the network to fall over. The amount of system-wide burst cores certainly moves this distribution over somewhat, but the case that is actually worth worrying about is

  • when all burst cores are utilized and
  • some validators draw 3-sigma or higher levels of workload

Burst cores definitionally cannot be utilized more than 50% of the time (as they are only used to cancel out missed opportunities). It's not clear that introducing a small number of burst cores actually leads to anything other than occasional slight finality delays on a few blocks, where a significant number of their tranche-0 checkers have all drawn 3-sigma workloads and then must be covered by tranche-1 validators. This increases network load slightly, and certainly linearly in the number of cores, but with a very small constant factor due to the probabilities involved.

@rphmeier
Copy link
Contributor Author

I am closing this RFC in anticipation of a simpler & more powerful replacement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants