Replies: 2 comments 2 replies
-
I kinda agree. The caches are not useful for internal IMEM/DMEM i.e. if they are enabled. Once they are disabled then they make sense though. In my SMP experiment I actively use such an arragement. Multiple harts, all with individual I$ and D$ and accessing a single shared SDRAM though the external memory interface. Internal IMEM and DMEM are disabled. My two cents: why not do both? Keep the existing caches sort of an Level 1 cache (maybe rename it?). But also introduce caches at your mentioned places sort of an level 2 cache? We could also choose to implement write-back instead of write-through. This could fix the latency issue with internal IMEM/DMEM? |
Beta Was this translation helpful? Give feedback.
-
I have been working on a more generic cache module that (hopefully) provides better efficiency by implementing "write-back" and "write-allocate" strategies instead of the current d-cache's "write-through" strategy. Right now, the cache is still direct-mapped. A 2-way set-associative cache might be much better - especially when using external memory for data and instructions. I tried to address this by implementing a "virtual splitting" of the new cache (VSPLIT). With this feature being enabled, the lower half of the cache blocks are reserved for data-only and the upper half of cache blocks are reserved for instructions-only. For some workloads (like code that uses the The cache is still in an early stage and needs some more optimization, but synthesis and simulation already seems fine: https://github.com/stnolting/neorv32/blob/generic_cache/rtl/core/neorv32_cache.vhd I made some simple benchmarks / tests:
Interestingly, the setup with the "external cache" (having the same size as the old I-cache + D-cache) is about ~10% faster (for a very synthetic workload). It seems like that bus congestion caused by the CPU's instruction and data interfaces are not a big performance issue. Anyway, more tests are required. But I wonder if we should simply remove the CPU caches and just stick with the external bus cache? 🤔 Btw, the new cache module is truly generic - so it could also be used for something like a mutli-core setup ... just saying 😉 |
Beta Was this translation helpful? Give feedback.
-
In general, a cache can improve overall performance hiding the latency of slow memories by utilizing temporal and spatial data locality. Right now we have two optional caches right next to the CPU: a read-only cache for instructions and a read/write cache for data.
The default processor-internal / FPGA-internal memories have an access latency of 1 cycle so the CPU can access them as fast as possible. If the caches are enabled, the overall performance drops significantly as these maximum-speed memories are cached, which requires additional time. Even worse, the data-cache uses a write-back architecture so any store operation bypasses the cache with an extra delay of one cycle.
A setup that only uses internal IMEM/DMEM in combination with the caches enabled has about ~40% less CoreMark performance than a system without caches.
Right now there are just two module that have more than 1 cycle access latency: the execute in-place module (XIP) and the external memory interface (WISHBONE). So accesses to these modules really can benefit from having the caches implemented.
Now the question: do we really need the caches? And do we need them where they are right now?
I've been thinking about relocating the caches. The read-only i-cache could be moved inside the XIP module while the read/write d-cache could be moved to the external bus interface. Thus, we would only cache accesses that for sure have a very large latency compared to the other internal modules.
Of the course the caches are quite hand where they are right now. The reduce bus traffic as the CPU's instruction and data ports have to share a single processor-wide bus. But actually, this is not a real deal, as the CPU is equipped with an instruction prefetch buffer that speculatively fetches instructions whenever the CPU does not perform a load/store operation. So I think it won't hurt (too much) not having the caches right in front of the CPU.
Anyway, I'm curious what you think. 😉
Beta Was this translation helpful? Give feedback.
All reactions