[Question] Should we remove / relocate the caches? #793

stnolting · 2024-02-05T20:10:58Z

stnolting
Feb 5, 2024
Maintainer

In general, a cache can improve overall performance hiding the latency of slow memories by utilizing temporal and spatial data locality. Right now we have two optional caches right next to the CPU: a read-only cache for instructions and a read/write cache for data.

The default processor-internal / FPGA-internal memories have an access latency of 1 cycle so the CPU can access them as fast as possible. If the caches are enabled, the overall performance drops significantly as these maximum-speed memories are cached, which requires additional time. Even worse, the data-cache uses a write-back architecture so any store operation bypasses the cache with an extra delay of one cycle.

A setup that only uses internal IMEM/DMEM in combination with the caches enabled has about ~40% less CoreMark performance than a system without caches.

Right now there are just two module that have more than 1 cycle access latency: the execute in-place module (XIP) and the external memory interface (WISHBONE). So accesses to these modules really can benefit from having the caches implemented.

Now the question: do we really need the caches? And do we need them where they are right now?

I've been thinking about relocating the caches. The read-only i-cache could be moved inside the XIP module while the read/write d-cache could be moved to the external bus interface. Thus, we would only cache accesses that for sure have a very large latency compared to the other internal modules.

Of the course the caches are quite hand where they are right now. The reduce bus traffic as the CPU's instruction and data ports have to share a single processor-wide bus. But actually, this is not a real deal, as the CPU is equipped with an instruction prefetch buffer that speculatively fetches instructions whenever the CPU does not perform a load/store operation. So I think it won't hurt (too much) not having the caches right in front of the CPU.

Anyway, I'm curious what you think. 😉

NikLeberg · 2024-02-05T21:03:47Z

NikLeberg
Feb 5, 2024
Collaborator

I kinda agree. The caches are not useful for internal IMEM/DMEM i.e. if they are enabled. Once they are disabled then they make sense though.

In my SMP experiment I actively use such an arragement. Multiple harts, all with individual I$ and D$ and accessing a single shared SDRAM though the external memory interface. Internal IMEM and DMEM are disabled.

My two cents: why not do both? Keep the existing caches sort of an Level 1 cache (maybe rename it?). But also introduce caches at your mentioned places sort of an level 2 cache?

We could also choose to implement write-back instead of write-through. This could fix the latency issue with internal IMEM/DMEM?

2 replies

stnolting Feb 5, 2024
Maintainer Author

In my SMP experiment I actively use such an arragement. Multiple harts, all with individual I$ and D$ and accessing a single shared SDRAM though the external memory interface. Internal IMEM and DMEM are disabled.

I agree, a multi-core setup definitely benefits from CPU-local caches.

However, this is still a very user-defined use case. Don't get me wrong, I would love to add a multi-core option to this project 😅, but right now I think this is out of scope. Burst transfers, a more efficient data cache and a RISC-V compliant interrupt controller would be more or less required for this.

My two cents: why not do both? Keep the existing caches sort of an Level 1 cache (maybe rename it?). But also introduce caches at your mentioned places sort of an level 2 cache?

Good point! I did not think about L1/L2 caches. I'm just worried that we could end up in configuration hell because of the complex configuration options. 🙈

We could also choose to implement write-back instead of write-through. This could fix the latency issue with internal IMEM/DMEM?

The write-through strategy is a real performance killer. However, this concept allows precise trapping (what if only one address signals an error during a cache block write back??) and I think this a key feature - maybe even more important than actual performance. 🤔

stnolting Feb 9, 2024
Maintainer Author

So let's start this by implementing "L2 caches":

a dedicated cache for the XIP module - a simple read-only direct-mapped cache should be sufficient here, right? 🤔 (✨ add optional XIP cache #799)
a dedicated cache for the external memory interface; this will be based on the current CPU data cache; however, I think we should really implement a "write-back" architecture and maybe even a set-associative cache layout. What you think?

Obviously, if these caches are enabled the CPU's I/D caches should not be enabled.

stnolting · 2024-03-04T21:13:00Z

stnolting
Mar 4, 2024
Maintainer Author

I have been working on a more generic cache module that (hopefully) provides better efficiency by implementing "write-back" and "write-allocate" strategies instead of the current d-cache's "write-through" strategy. Right now, the cache is still direct-mapped. A 2-way set-associative cache might be much better - especially when using external memory for data and instructions.

I tried to address this by implementing a "virtual splitting" of the new cache (VSPLIT). With this feature being enabled, the lower half of the cache blocks are reserved for data-only and the upper half of cache blocks are reserved for instructions-only. For some workloads (like code that uses the C extension) this split setup has a performance boost of ~30% while for other workloads (only uncompressed instructions) the performance is even worse than before.

The cache is still in an early stage and needs some more optimization, but synthesis and simulation already seems fine: https://github.com/stnolting/neorv32/blob/generic_cache/rtl/core/neorv32_cache.vhd

I made some simple benchmarks / tests:

remove current I/D caches
add new generic cache to the processor's external memory interface
disable internal IMEM/DMEM memories and use external memories only (that are cached by the new cache)

Interestingly, the setup with the "external cache" (having the same size as the old I-cache + D-cache) is about ~10% faster (for a very synthetic workload). It seems like that bus congestion caused by the CPU's instruction and data interfaces are not a big performance issue.

Anyway, more tests are required. But I wonder if we should simply remove the CPU caches and just stick with the external bus cache? 🤔

Btw, the new cache module is truly generic - so it could also be used for something like a mutli-core setup ... just saying 😉

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Should we remove / relocate the caches? #793

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[Question] Should we remove / relocate the caches? #793

stnolting Feb 5, 2024 Maintainer

Replies: 2 comments · 2 replies

NikLeberg Feb 5, 2024 Collaborator

stnolting Feb 5, 2024 Maintainer Author

stnolting Feb 9, 2024 Maintainer Author

stnolting Mar 4, 2024 Maintainer Author

stnolting
Feb 5, 2024
Maintainer

Replies: 2 comments 2 replies

NikLeberg
Feb 5, 2024
Collaborator

stnolting Feb 5, 2024
Maintainer Author

stnolting Feb 9, 2024
Maintainer Author

stnolting
Mar 4, 2024
Maintainer Author