Publish early icebreaker demo.

cbalint13 · Jul 1, 2021 · 54c4594 · 54c4594
1 parent b72f48b
commit 54c4594
Show file tree

Hide file tree

Showing 26 changed files with 6,768 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,2 +1,54 @@
+
 # OLIMP
-Open Library for Intelligent Machine Processors
+
+**O**pen **L**ibrary for **I**nteger **M**achine **P**rocessing
+
+![OLIMP](https://github.com/cbalint13/OLIMP/blob/main/docs/logo/olimp-logo.png)
+
+OLIMP is a collection of **configurable hardware elements** that operates on **vectors** and ultimately **tensors**.
+
+```
+Q: What OLIMP can do ?
+A: Hardware for numerical operators.
+
+Q: What OLIMP means for a design ?
+A: Designing numerical elements in the vast combinatoric space.
+
+Q: What is the key for any OLIMP operator ?
+A: Scheduling is **the key** booth in hardware and software.
+
+Q: What OLIMP can be used to ?
+A: Build hardware that computes on any budget, from tiny FPGA to large ASIC.
+```
+
+OLIMP use [TVM](https://github.com/apache/tvm) for compute **scheduling** closing the gap between **hardware** and **software** end-to-end design.
+
+-------------------------------------------------------------------------------------
+
+OLIMP targets operators having various precision:
+
+ * **integers** of 2,4,8,16,32 bit length with fixed or mixed precision
+ * bitwise ordered **bitplanes** on atomic boolean logic: AND, XOR, POPCNT
+
+--------------------------------------------------------------------------------------
+
+### Demo SoC
+
+  Checkout [Demo SoC](/demo/):
+
+   - OLIMP vector element as RISC-V extension on tiny [icebreaker](https://github.com/icebreaker-fpga/icebreaker).
+   - OLIMP elements for dedicated [e-verest](https://github.com/cbalint13/e-verest) usb stick on the budget.
+
+--------------------------------------------------------------------------------------
+
+
+**ChangeLog**:
+   * *01-Jun-2021* early release
+
+**ToDo (WiP)**:
+   * finish icebreaker demo
+   * publish RTL generators with documentation
+   * publish TVM TOPI schedules for each RTL module
+   * validate RTL modules via TVM [verilate](https://github.com/apache/tvm/tree/main/src/relay/backend/contrib/verilator) 
+   * constrainted end-to-end arhitecture search & optimisation
+   * showcase advanced OLIMP blocks on e-verest (ECP5-85k)
diff --git a/demo/everest/README.md b/demo/everest/README.md
@@ -0,0 +1,5 @@
+
+**WiP**
+
+  OLIMP demos for [e-verest](https://github.com/cbalint13/e-verest) stick.
+
diff --git a/demo/icebreaker/Makefile b/demo/icebreaker/Makefile
@@ -0,0 +1,74 @@
+
+CROSS=riscv64-unknown-elf-
+CFLAGS=-g -Os -march=rv32im -mabi=ilp32
+
+
+all: out/icebreaker.json out/icebreaker.bin
+
+sim: icesim
+
+flash: iceprog
+
+##
+## Synthesis
+##
+
+out/icebreaker.json: rtl/icebreaker.v rtl/spimemio.v rtl/simpleuart.v rtl/icesoc.v rtl/picorv32.v rtl/olimp.v rtl/memory.v
+	mkdir -p log out
+	yosys -ql log/icebreaker-syn.log -p 'synth_ice40 -abc9 -device u -top icebreaker -json out/icebreaker.json' $^
+	cat log/icebreaker-syn.log | sed -n '/statistics/,/CHECK/p'
+
+out/icebreaker.asc: out/icebreaker.json
+	nextpnr-ice40 --freq 40 -l log/icebreaker-pnr.log --pre-pack brd/clocks.py --up5k  --package sg48 --asc out/icebreaker.asc --pcf brd/icebreaker.pcf --json out/icebreaker.json
+	cat log/icebreaker-pnr.log | sed -n '/Device utilisation/,/Placed/p'
+
+out/icebreaker.bin: out/icebreaker.asc
+	icetime -d up5k -c 20 -mtr out/icebreaker.rpt out/icebreaker.asc
+	icepack out/icebreaker.asc out/icebreaker.bin
+
+##
+## Simulation
+##
+
+out/icebreaker_tb.vvp: sim/icebreaker_tb.v rtl/icebreaker.v rtl/spimemio.v rtl/simpleuart.v rtl/icesoc.v rtl/picorv32.v rtl/olimp.v rtl/memory.v sim/spiflash.v sim/clkdiv.v
+	mkdir -p log out
+	iverilog -s testbench -o $@ $^ `yosys-config --datdir/ice40/cells_sim.v`
+
+icesim: out/icebreaker_tb.vvp out/icebreaker_fw.hex
+	vvp -N $< +firmware=out/icebreaker_fw.hex
+
+##
+## Flash
+##
+
+iceprog: out/icebreaker.bin out/icebreaker_fw.bin
+	iceprog out/icebreaker.bin
+	iceprog -o 1M out/icebreaker_fw.bin
+
+iceprog_fw: icebreaker_fw.bin
+	iceprog -o 1M out/icebreaker_fw.bin
+
+##
+## Firmware
+##
+
+brd/icebreaker_sections.lds: src/sections.lds
+	$(CROSS)cpp -P -DICEBREAKER -o $@ $^
+
+out/icebreaker_fw.elf: brd/icebreaker_sections.lds src/start.s src/firmware.c
+	$(CROSS)gcc $(CFLAGS) -DICEBREAKER -Wl,-Bstatic,-T,brd/icebreaker_sections.lds,--strip-debug -ffreestanding -nostdlib -o out/icebreaker_fw.elf src/start.s src/firmware.c
+	$(CROSS)size --format berkley out/icebreaker_fw.elf
+
+out/icebreaker_fw.hex: out/icebreaker_fw.elf
+	$(CROSS)objcopy -O verilog out/icebreaker_fw.elf out/icebreaker_fw.hex
+
+out/icebreaker_fw.bin: out/icebreaker_fw.elf
+	$(CROSS)objcopy -O binary out/icebreaker_fw.elf out/icebreaker_fw.bin
+
+# ---- Clean ----
+
+clean:
+	rm -rf out log
+	rm -f testbench.vcd
+
+.PHONY: iceprog iceprog_fw icesim
diff --git a/demo/icebreaker/README.md b/demo/icebreaker/README.md
@@ -0,0 +1,225 @@
+# RISC-V System On Chip with OLIMP extension
+
+The demo **RISC-V SoC** on [IceBreaker](https://github.com/icebreaker-fpga/icebreaker) implements:
+
+   * **CPU** @ **20Mhz** **rv32im** using [PicoRV32](https://github.com/cliffordwolf/picorv32) with ISA extensions
+   * **4MByte ROM** as fast Quad (x4) DDR (+QPI) continous memory mapped @ **40Mhz** I/O clock
+   * **128kByte RAM** main memory organized as **64bit** wide (4 x 32kByte SRAM)
+   * **128bit wide** BRAM memory for coefficients (N x 8 x 128Byte BRAM)
+   * RISC-V **ISA extended** with OLIMP **VEC-8U8-16I8-2S32** running @ **40Mhz** DSP clock
+
+*Note*: OLIMP VEC-8U8-16I8-2S32 is the largest possible vector block that fits ICE40 up5k.
+
+-------------------------------------------------------------------------------------
+
+## The OLIMP **VEC-8U8-16I8-2S32** block:
+
+```
+IN Vectors:
+    8 x uint8 ( 64 bit)
+   16 x  int8 (128 bit)
+   ---------------------
+OUT Lanes:
+   2 x int32  (2 x 32 bit)
+```
+![8U8-16I8-2S32](/docs/imgs/OLIMP-VEC-8U8-16I8-2S32.png)
+
+
+The OLIMP block extends the **rv32im** ISA via [PCPI](https://github.com/cliffordwolf/picorv32#pico-co-processor-interface-pcpi) interface.
+
+The **VEC-8U8-16I8-2S32** block executes in **2 x CPU clock** cycles but picorv32 PCPI completes in **6 x CPU clock** cycles.
+
+-------------------------------------------------------------------------------------
+
+## TVM magic tensorization using OLIMP's MACC
+
+Example of generated TIR representation for a MATMUL [64x64]*[64x64] inside TVM:
+
+```
+primfn(X_1: handle, coeffW_1: handle, F.global_1: handle) -> ()
+  attr = {"global_symbol": "main", "tir.noalias": True}
+  buffers = {F.global: Buffer(F.global_2: Pointer(int32), int32, [64, 64], []),
+             coeffW: Buffer(coeffW_2: Pointer(int8), int8, [2, 32, 8, 8], []),
+             X: Buffer(X_2: Pointer(uint8), uint8, [64, 64], [])}
+  buffer_map = {X_1: X, coeffW_1: coeffW, F.global_1: F.global} {
+  attr [F: Pointer(int32)] "storage_scope" = "global";
+  allocate(F, int32, [4096]);
+  attr [coeffW.global: Pointer(int8)] "storage_scope" = "global";
+  allocate(coeffW.global, int8, [128]);
+  for (i: int32, 0, 64) {
+    for (j.outer: int32, 0, 32) {
+      for (ax0: int32, 0, 2) {
+        for (ax2: int32, 0, 8) {
+          for (ax3: int32, 0, 8) {
+            coeffW.global[(((ax0*64) + (ax2*8)) + ax3)] = (int8*)coeffW_2[((((ax0*2048) + (j.outer*64)) + (ax2*8)) + ax3)]
+          }
+        }
+      }
+      @tir.call_extern("MACZ_olimp", 
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, (i*64), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 0, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp",
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 8), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 8, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp",
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 16), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 16, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp",
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 24), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 24, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp",
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 32), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 32, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp", 
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 40), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 40, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp",
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 48), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 48, 128, 1, dtype=handle), 64, dtype=int32)
+      @tir.call_extern("MACC_olimp", 
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int32), F.global_2, ((i*64) + (j.outer*2)), 2, 2, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=uint8), X_2, ((i*64) + 56), 8, 1, dtype=handle),
+        @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), coeffW.global, 56, 128, 1, dtype=handle), 64, dtype=int32)
+      for (j.inner: int32, 0, 2) {
+        F[(((i*64) + (j.outer*2)) + j.inner)] = (int32*)F.global_2[(((i*64) + (j.outer*2)) + j.inner)]
+      }
+    }
+  }
+}
+```
+
+ [TVM](https://github.com/apache/tvm) leverage **complete** end-to-end code generation to **C language** using OLIMP hardware scheduling:
+
+   * All **dense** operations will benefit **@tir.call_extern("MACC_olimp")** the OLIMP hardware block.
+   * Further **conv2d** schedules translates to many **dense** schedules
+   * Any **other** operators not covered by OLIMP hardware will be covered by the soft RV32IM (slower)
+   * TVM schedulers also **guarantee** continuous & aligned access to the vector segments in memory
+   * TVM schedulers **handle memory** prefetching (DMA) or constrained access from slow memory regions.
+
+### Example of generated C code
+
+```
+// tvm target: c -keys=cpu -link-params=0
+#define TVM_EXPORTS
+#include "tvm/runtime/c_runtime_api.h"
+#include "tvm/runtime/c_backend_api.h"
+#include <math.h>
+#ifdef __cplusplus
+extern "C"
+#endif
+TVM_DLL int32_t intrinsic(void* args, void* arg_type_ids, int32_t num_args, void* out_ret_value, void* out_ret_tcode, void* resource_handle) {
+  void* arg0 = (((TVMValue*)args)[0].v_handle);
+  int32_t arg0_code = ((int32_t*)arg_type_ids)[(0)];
+  void* arg1 = (((TVMValue*)args)[1].v_handle);
+  int32_t arg1_code = ((int32_t*)arg_type_ids)[(1)];
+  void* arg2 = (((TVMValue*)args)[2].v_handle);
+  int32_t arg2_code = ((int32_t*)arg_type_ids)[(2)];
+  void* X = (((DLTensor*)arg0)[0].data);
+  void* arg0_shape = (((DLTensor*)arg0)[0].shape);
+  void* arg0_strides = (((DLTensor*)arg0)[0].strides);
+  int32_t dev_id = (((DLTensor*)arg0)[0].device.device_id);
+  void* coeffW = (((DLTensor*)arg1)[0].data);
+  void* arg1_shape = (((DLTensor*)arg1)[0].shape);
+  void* arg1_strides = (((DLTensor*)arg1)[0].strides);
+  void* F_global = (((DLTensor*)arg2)[0].data);
+  void* arg2_shape = (((DLTensor*)arg2)[0].shape);
+  void* arg2_strides = (((DLTensor*)arg2)[0].strides);
+  if (!(arg0_strides == NULL)) {
+  }
+  if (!(arg1_strides == NULL)) {
+  }
+  if (!(arg2_strides == NULL)) {
+  }
+  void* F = TVMBackendAllocWorkspace(1, dev_id, (uint64_t)16384, 0, 32);
+  if (F == NULL) {
+    return -1;
+  }
+  void* coeffW_global = TVMBackendAllocWorkspace(1, dev_id, (uint64_t)128, 0, 8);
+  if (coeffW_global == NULL) {
+    return -1;
+  }
+  for (int32_t i = 0; i < 64; ++i) {
+    for (int32_t j_outer = 0; j_outer < 32; ++j_outer) {
+      for (int32_t ax0 = 0; ax0 < 2; ++ax0) {
+        for (int32_t ax2 = 0; ax2 < 8; ++ax2) {
+          for (int32_t ax3 = 0; ax3 < 8; ++ax3) {
+            ((int8_t*)coeffW_global)[((((ax0 * 64) + (ax2 * 8)) + ax3))] = 
+                ((int8_t*)coeffW)[(((((ax0 * 2048) + (j_outer * 64)) + (ax2 * 8)) + ax3))];
+          }
+        }
+      }
+      (void)MACZ_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + ((i * 64))), 
+                       ((int8_t *)coeffW_global + (0)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + (((i * 64) + 8))), 
+                       ((int8_t *)coeffW_global + (8)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + (((i * 64) + 16))), ((int8_t *)coeffW_global + (16)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + (((i * 64) + 24))), ((int8_t *)coeffW_global + (24)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + (((i * 64) + 32))), ((int8_t *)coeffW_global + (32)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + (((i * 64) + 40))), ((int8_t *)coeffW_global + (40)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))),
+                       ((uint8_t *)X + (((i * 64) + 48))), ((int8_t *)coeffW_global + (48)), 64);
+      (void)MACC_olimp(((int32_t *)F_global + (((i * 64) + (j_outer * 2)))), 
+                       ((uint8_t *)X + (((i * 64) + 56))), ((int8_t *)coeffW_global + (56)), 64);
+      for (int32_t j_inner = 0; j_inner < 2; ++j_inner) {
+        ((int32_t*)F)[((((i * 64) + (j_outer * 2)) + j_inner))] =
+            ((int32_t*)F_global)[((((i * 64) + (j_outer * 2)) + j_inner))];
+      }
+    }
+  }
+  if (TVMBackendFreeWorkspace(1, dev_id, coeffW_global) != 0) {
+    return -1;
+  }
+  if (TVMBackendFreeWorkspace(1, dev_id, F) != 0) {
+    return -1;
+  }
+  return 0;
+}
+
+```
+
+*Note*: [MACC_olimp()](/demo/icebreaker/src/firmware.c#L340) are wrapped **__asm__( ".word 0xRV32custom")** RV32 ISA extension for OLIMP hardware block.
+
+-------------------------------------------------------------------------------------
+
+## Synthesis on ICE40 UP5K (IceBreaker)
+
+ICE40 UP5K summary (01-Jul-2021):
+```
+    ICESTORM_LC:  4394/ 5280    83%
+    ICESTORM_RAM:   12/   30    40%
+    ICESTORM_DSP:    8/    8   100%
+    ICESTORM_SPRAM:  4/    4   100%
+```
+
+Clocking:
+```
+Info: Max frequency for clock 'clk_spi': 26.12 MHz (PASS at 25.00 MHz)
+Info: Max frequency for clock 'clk_cpu': 24.72 MHz (PASS at 20.00 MHz)
+```
+*Note*: clk_spi (also drive ICE40_DSP) in fact closes > 40Mhz.
+
+-------------------------------------------------------------------------------------
+
+
+## ChangeLog
+   * *01-Jun-2021* early demo experiments
+
+## ToDo (WiP)
+   * finish access to final accumulated lanes in RTL
+   * add RTL for memapping small camera & microphone
+   * publish TVM code parts to support: dense, conv2d
+   * TVM tutorial on end-to-end nnet importing from tflow, pytorch, onnx
+
diff --git a/demo/icebreaker/brd/clocks.py b/demo/icebreaker/brd/clocks.py
@@ -0,0 +1,2 @@
+ctx.addClock("clk_spi", 25)
+ctx.addClock("clk_cpu", 20)
diff --git a/demo/icebreaker/brd/icebreaker.pcf b/demo/icebreaker/brd/icebreaker.pcf
@@ -0,0 +1,14 @@
+# 12 MHz clock
+set_io osc12      35
+
+# RS232
+set_io ser_rx      6
+set_io ser_tx      9
+
+# SPI Flash
+set_io flash_clk  15
+set_io flash_csb  16
+set_io flash_io0  14
+set_io flash_io1  17
+set_io flash_io2  12
+set_io flash_io3  13
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,5 @@

		WiP

		OLIMP demos for [e-verest](https://github.com/cbalint13/e-verest) stick.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		ctx.addClock("clk_spi", 25)
		ctx.addClock("clk_cpu", 20)