Skip to content

Commit

Permalink
🧪 Add processor data cache (#560)
Browse files Browse the repository at this point in the history
  • Loading branch information
stnolting authored Mar 28, 2023
2 parents d610a0b + 2ea0b9a commit adf4d6c
Show file tree
Hide file tree
Showing 22 changed files with 943 additions and 183 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ mimpid = 0x01040312 => Version 01.04.03.12 => v1.4.3.12

| Date (*dd.mm.yyyy*) | Version | Comment |
|:-------------------:|:-------:|:--------|
| 25.03.2023 | 1.8.2.8 | :test_tube: add configurable data cache (**dCACHE**); [#560](https://github.com/stnolting/neorv32/pull/560) |
| 24.03.2023 | 1.8.2.7 | :sparkles: add full support of `mcounteren` CSR; cleanup counter and PMP CSRs; i-cache optimization; [#559](https://github.com/stnolting/neorv32/pull/559) |
| 18.03.2023 | 1.8.2.6 | add new generic `JEDEC_ID` (official JEDEC identifier; used for `mvendorid` CSR); further generics cleanups; [#557](https://github.com/stnolting/neorv32/pull/557)
| 17.03.2023 | 1.8.2.5 | add RISC-V `time[h]` CSRs (part of the `Zicntr` ISA extension); [#556](https://github.com/stnolting/neorv32/pull/556) |
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,8 @@ for _custom RISC-V instructions_ (R3-type, R4-type and R5-type);

* processor-internal data and instruction memories ([DMEM](https://stnolting.github.io/neorv32/#_data_memory_dmem) /
[IMEM](https://stnolting.github.io/neorv32/#_instruction_memory_imem)) &
cache ([iCACHE](https://stnolting.github.io/neorv32/#_processor_internal_instruction_cache_icache))
caches ([iCACHE](https://stnolting.github.io/neorv32/#_processor_internal_instruction_cache_icache) and
[dCACHE](https://stnolting.github.io/neorv32/#_processor_internal_data_cache_dcache))
* pre-installed bootloader ([BOOTLDROM](https://stnolting.github.io/neorv32/#_bootloader_rom_bootrom)) with serial user interface;
allows booting application code via UART or from external SPI flash

Expand Down
9 changes: 6 additions & 3 deletions docs/datasheet/cpu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -312,10 +312,13 @@ The `I` ISA extensions is the base RISC-V integer ISA that is always enabled.
| Data fence | `fence` | 5
|=======================

.`fence` Instruction
[NOTE]
Internally, the `fence` instruction does not perform any operation inside the CPU. It only sets the
CPU-internally, the `fence` instruction does not perform any operation inside the CPU. It only sets the
top's `d_bus_fence_o` signal high for one cycle to inform the memory system a `fence` instruction has been
executed. Any flags within the `fence` instruction word are ignore by the hardware.
executed. Any flags within the `fence` instruction word are ignore by the hardware. However, the `d_bus_fence_o`
signal is connected to the <<_processor_internal_data_cache_dcache>>. Hence, executing the `fence` instruction
will clear/flush the data cache and resynchronize it with main memory.


==== `B` ISA Extension
Expand Down Expand Up @@ -407,7 +410,7 @@ RISC-V specs. Also, custom trap codes for <<_mcause>> are implemented.
The `Zifencei` CPU extension allows manual synchronization of the instruction stream.
The `fence.i` instruction resets the CPU's front-end (instruction fetch) and flushes the prefetch buffer.
This allows a clean re-fetch of modified instructions from memory. Also, the top's `i_bus_fencei_o` signal is set
high for one cycle to inform the memory system (like the i-cache to perform a flush/reload.
high for one cycle to inform the memory system (like the <<_processor_internal_instruction_cache_icache>> to perform a flush/reload.
Any additional flags within the `fence.i` instruction word are ignored by the hardware.

.Instructions and Timing
Expand Down
4 changes: 3 additions & 1 deletion docs/datasheet/overview.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ include::rationale.adoc[]
* optional standard serial interfaces (UART, TWI, SPI (host and device), 1-Wire)
* optional timers and counters (watchdog, system timer)
* optional general purpose IO and PWM; a native NeoPixel(c)-compatible smart LED interface
* optional embedded memories / caches for data, instructions and bootloader
* optional embedded memories and caches for data, instructions and bootloader
* optional external memory interface for custom connectivity
* optional execute in-place (XIP) module to execute code directly form an external SPI flash
* on-chip debugger compatible with OpenOCD and gdb including hardware trigger module
Expand Down Expand Up @@ -196,6 +196,7 @@ neorv32_top.vhd - NEORV32 Processor top entity
├neorv32_busswitch.vhd - Processor bus switch for CPU buses (I&D)
├neorv32_bus_keeper.vhd - Processor-internal bus monitor
├neorv32_cfs.vhd - Custom functions subsystem
├neorv32_dcache.vhd - Processor-internal data cache
├neorv32_debug_dm.vhd - on-chip debugger: debug module
├neorv32_debug_dtm.vhd - on-chip debugger: debug transfer module
├neorv32_dmem.entity.vhd - Processor-internal data memory (entity-only!)
Expand Down Expand Up @@ -303,6 +304,7 @@ https://stnolting.github.io/neorv32/ug/#_application_specific_processor_configur
| GPIO | General purpose input/output ports | 102 | 98 | 0 | 0
| GPTMR | General Purpose Timer | 153 | 105 | 0 | 0
| iCACHE | Instruction cache (2x4 blocks, 64 bytes per block) | 417 | 297 | 4096 | 0
| dCACHE | Data cache (8 blocks, 64 bytes per block) | 417 | 297 | 4096 | 0
| IMEM | Processor-internal instruction memory (16kB) | 12 | 2 | 131072 | 0
| MTIME | Machine system timer | 345 | 166 | 0 | 0
| NEOLED | Smart LED Interface (NeoPixel/WS28128) (FIFO_depth=1) | 227 | 184 | 0 | 0
Expand Down
11 changes: 9 additions & 2 deletions docs/datasheet/soc.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ image::neorv32_processor.png[align=center]
**Key Features**

* _optional_ processor-internal data and instruction memories (<<_data_memory_dmem,**DMEM**>>/<<_instruction_memory_imem,**IMEM**>>) + cache (<<_processor_internal_instruction_cache_icache,**iCACHE**>>)
* _optional_ processor-internal data and instruction memories (<<_data_memory_dmem,**DMEM**>>/<<_instruction_memory_imem,**IMEM**>>)
* _optional_ caches (<<_processor_internal_instruction_cache_icache,**iCACHE**>>/<<_processor_internal_data_cache_dcache,**dCACHE**>>)
* _optional_ internal bootloader (<<_bootloader_rom_bootrom,**BOOTROM**>>) with UART console & SPI flash boot option
* _optional_ machine system timer (<<_machine_system_timer_mtime,**MTIME**>>), RISC-V-compatible
* _optional_ two independent universal asynchronous receivers and transmitters (<<_primary_universal_asynchronous_receiver_and_transmitter_uart0,**UART0**>>,
Expand Down Expand Up @@ -230,6 +231,10 @@ The generic type "suv(x:y)" defines a `std_ulogic_vector(x downto y)`.
| `ICACHE_NUM_BLOCKS` | natural | 4 | Number of blocks ("pages" or "lines") Has to be a power of two.
| `ICACHE_BLOCK_SIZE` | natural | 64 | Size in bytes of each block. Has to be a power of two.
| `ICACHE_ASSOCIATIVITY` | natural | 1 | Associativity (number of sets). Allowed configurations: `1` = 1 set, direct mapped; `2` = 2-way set-associative.
4+^| **<<_processor_internal_data_cache_dcache>>**
| `DCACHE_EN` | boolean | false | Implement the data cache.
| `DCACHE_NUM_BLOCKS` | natural | 4 | Number of blocks ("pages" or "lines") Has to be a power of two.
| `DCACHE_BLOCK_SIZE` | natural | 64 | Size in bytes of each block. Has to be a power of two.
4+^| **<<_processor_external_memory_interface_wishbone>>**
| `MEM_EXT_EN` | boolean | false | Implement the external bus interface.
| `MEM_EXT_TIMEOUT` | natural | 255 | Clock cycles after which a pending external bus access will auto-terminate and raise a bus fault exception.
Expand Down Expand Up @@ -492,7 +497,7 @@ image::neorv32_bus.png[1300]
[NOTE]
The internal processor bus might appear as bottleneck. In order to reduce traffic jam on this bus
(when instruction fetch and data interface access the bus at the same time) the instruction fetch of
the CPU is equipped with a prefetch buffer. Instruction fetches can be further buffered using the i-cache.
the CPU is equipped with a prefetch buffer. Memory accesses can be further buffered using the caches.
Furthermore, data accesses (loads and stores) have higher priority than instruction fetch accesses.

[TIP]
Expand Down Expand Up @@ -653,6 +658,8 @@ include::soc_bootrom.adoc[]

include::soc_icache.adoc[]

include::soc_dcache.adoc[]

include::soc_wishbone.adoc[]

include::soc_buskeeper.adoc[]
Expand Down
51 changes: 51 additions & 0 deletions docs/datasheet/soc_dcache.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
<<<
:sectnums:
==== Processor-Internal Data Cache (dCACHE)

[cols="<3,<3,<4"]
[frame="topbot",grid="none"]
|=======================
| Hardware source file(s): | neorv32_dcache.vhd |
| Software driver file(s): | none | _implicitly used_
| Top entity port: | none |
| Configuration generics: | `DCACHE_EN` | implement processor-internal data cache when `true`
| | `DCACHE_NUM_BLOCKS` | number of cache blocks (pages/lines)
| | `DCACHE_BLOCK_SIZE` | size of a cache block in bytes
| CPU interrupts: | none |
|=======================

The processor features an optional data cache to improve performance when using memories with high
access latencies. The cache is directly connected to the CPU's data access interface and provides
full-transparent buffering.

The cache is implemented if the `DCACHE_EN` generic is `true`. The size of the cache memory is defined via the
`DCACHE_BLOCK_SIZE` (the size of a single cache block/page/line in bytes; has to be a power of two and greater than or
equal to 4 bytes) and `DCACHE_NUM_BLOCKS` (the total amount of cache blocks; has to be a power of two and greater than or
equal to 1) generics. The data cache provides only a single set, hence it is direct-mapped.

The data cache provides direct accesses (= uncached) to memory in order to access memory-mapped IO (likt the
processor-internal IO/peripheral modules). All accesses that target the address range from `0xF0000000` to `0xFFFFFFFF`
will not be cached at all. This also allows to attach custom IO modules via the processor's external memory interface
when they are mapped to upper-most 257 MB address page (see section <<_address_space>>).

.Caching Internal Memories
[NOTE]
The data cache is intended to accelerate data access to **processor-external** memories
(via the external bus interface or via the XIP module).

.Manual Cache Clear/Reload
[NOTE]
By executing the `fence` instruction (<<_i_isa_extension>>) the cache is cleared and a reload from
main memory is triggered.

.Retrieve Cache Configuration from Software
[TIP]
Software can retrieve the cache configuration/layout from the <<_sysinfo_cache_configuration>> register.

.Bus Access Fault Handling
[NOTE]
The cache always loads a complete cache block (aligned to the block size) every time a
cache miss is detected. Each cached word from this block provides a single status bit that indicates if the
according bus access was successful or caused a bus error. Hence, the whole cache block remains valid even
if certain addresses inside caused a bus error. If the CPU accesses any of the faulty cache words, a
data bus error exception is raised.
17 changes: 8 additions & 9 deletions docs/datasheet/soc_icache.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,18 @@

The processor features an optional instruction cache to improve performance when using memories with high
access latencies. The cache is directly connected to the CPU's instruction fetch interface and provides
full-transparent buffering of instruction fetch accesses to the entire address space.
full-transparent buffering of instruction fetch accesses to the **entire address space**.

The cache is implemented if the `ICACHE_EN` generic is `true`. The size of the cache memory is defined via
`ICACHE_BLOCK_SIZE` (the size of a single cache block/page/line in bytes; has to be a power of two and greater than or
equal to 4 bytes), `ICACHE_NUM_BLOCKS` (the total amount of cache blocks; has to be a power of two and greater than or
equal to 1) and the actual cache associativity `ICACHE_ASSOCIATIVITY` (number of sets; 1 = direct-mapped, 2 = 2-way set-associative).
If the cache associativity is greater than one the LRU replacement policy (least recently used) is used.

equal to 1) and the actual cache associativity `ICACHE_ASSOCIATIVITY` (number of sets; 1 = direct-mapped, 2 = 2-way
set-associative) generics. If the cache associativity is greater than one the LRU replacement policy (least recently
used) is used.

.Caching Internal Memories
[NOTE]
The instruction cache is intended to accelerate instruction fetches from _processor-external_ memories
The instruction cache is intended to accelerate instruction fetches from **processor-external** memories
(via the external bus interface or via the XIP module).

.Manual Cache Clear/Reload
Expand All @@ -40,11 +40,10 @@ main memory is triggered. This also allows to implement self-modifying code.
[TIP]
Software can retrieve the cache configuration/layout from the <<_sysinfo_cache_configuration>> register.


**Bus Access Fault Handling**

.Bus Access Fault Handling
[NOTE]
The cache always loads a complete cache block (aligned to the block size) every time a
cache miss is detected. Each cached word from this block provides a single status bit that indicates if the
according bus access was successful or caused a bus error. Hence, the whole cache block remains valid even
if certain addresses inside caused a bus error. If the CPU accesses any of the faulty cache words, an
instruction access error exception is raised.
instruction bus error exception is raised.
18 changes: 11 additions & 7 deletions docs/datasheet/soc_sysinfo.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ will signal a "DEVICE ERROR" in this case.
| `3` | `SYSINFO_SOC_MEM_INT_DMEM` | set if the processor-internal IMEM is implemented (via top's `MEM_INT_IMEM_EN` generic)
| `4` | `SYSINFO_SOC_MEM_EXT_ENDIAN` | set if external bus interface uses BIG-endian byte-order (via top's `MEM_EXT_BIG_ENDIAN` generic)
| `5` | `SYSINFO_SOC_ICACHE` | set if processor-internal instruction cache is implemented (via top's `ICACHE_EN` generic)
| `12:6` | - | _reserved_, read as zero
| `6` | `SYSINFO_SOC_DCACHE` | set if processor-internal data cache is implemented (via top's `DCACHE_EN` generic)
| `12:7` | - | _reserved_, read as zero
| `13` | `SYSINFO_SOC_IS_SIM` | set if processor is being **simulated** (⚠️ not guaranteed)
| `14` | `SYSINFO_SOC_OCD` | set if on-chip debugger implemented (via top's `ON_CHIP_DEBUGGER_EN` generic)
| `15` | - | _reserved_, read as zero
Expand Down Expand Up @@ -90,10 +91,13 @@ Bit fields in this register are set to all-zero if the according cache is not im
[cols="^1,<10,<11"]
[options="header",grid="all"]
|=======================
| Bit | Name [C] | Function
| `3:0` | `SYSINFO_CACHE_IC_BLOCK_SIZE_3 : SYSINFO_CACHE_IC_BLOCK_SIZE_0` | _log2_(i-cache block size in bytes), via top's `ICACHE_BLOCK_SIZE` generic
| `7:4` | `SYSINFO_CACHE_IC_NUM_BLOCKS_3 : SYSINFO_CACHE_IC_NUM_BLOCKS_0` | _log2_(i-cache number of cache blocks), via top's `ICACHE_NUM_BLOCKS` generic
| `11:9` | `SYSINFO_CACHE_IC_ASSOCIATIVITY_3 : SYSINFO_CACHE_IC_ASSOCIATIVITY_0` | _log2_(i-cache associativity), via top's `ICACHE_ASSOCIATIVITY` generic
| `15:12` | `SYSINFO_CACHE_IC_REPLACEMENT_3 : SYSINFO_CACHE_IC_REPLACEMENT_0` | i-cache replacement policy (`0001` = LRU if associativity > 0)
| `32:16` | - | zero, reserved for d-cache
| Bit | Name [C] | Function
| `3:0` | `SYSINFO_CACHE_IC_BLOCK_SIZE_3 : SYSINFO_CACHE_IC_BLOCK_SIZE_0` | _log2_(i-cache block size in bytes), via top's `ICACHE_BLOCK_SIZE` generic
| `7:4` | `SYSINFO_CACHE_IC_NUM_BLOCKS_3 : SYSINFO_CACHE_IC_NUM_BLOCKS_0` | _log2_(i-cache number of cache blocks), via top's `ICACHE_NUM_BLOCKS` generic
| `11:9` | `SYSINFO_CACHE_IC_ASSOCIATIVITY_3 : SYSINFO_CACHE_IC_ASSOCIATIVITY_0` | _log2_(i-cache associativity), via top's `ICACHE_ASSOCIATIVITY` generic
| `15:12` | `SYSINFO_CACHE_IC_REPLACEMENT_3 : SYSINFO_CACHE_IC_REPLACEMENT_0` | i-cache replacement policy (`0001` = LRU if associativity > 0)
| `19:16` | `SYSINFO_CACHE_DC_BLOCK_SIZE_3 : SYSINFO_CACHE_DC_BLOCK_SIZE_0` | _log2_(d-cache block size in bytes), via top's `DCACHE_BLOCK_SIZE` generic
| `23:20` | `SYSINFO_CACHE_DC_NUM_BLOCKS_3 : SYSINFO_CACHE_DC_NUM_BLOCKS_0` | _log2_(d-cache number of cache blocks), via top's `DCACHE_NUM_BLOCKS` generic
| `27:24` | `SYSINFO_CACHE_DC_ASSOCIATIVITY_3 : SYSINFO_CACHE_DC_ASSOCIATIVITY_0` | always zero
| `31:28` | `SYSINFO_CACHE_DC_REPLACEMENT_3 : SYSINFO_CACHE_DC_REPLACEMENT_0` | always zero
|=======================
Binary file modified docs/figures/neorv32_bus.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/figures/neorv32_processor.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit adf4d6c

Please sign in to comment.