# From Blinker to RISC-V This tutorial is a progressive journey from a simple blinky design to a RISC-V core. It works with the following boards: - IceStick - IceBreaker - ULX3S - ARTY If you do not have a board, you can run everything in simulation (but it is not as fun). ## About this tutorial - it is a progressive introduction, changing only one thing at a time. It is a curated version of my logbook when I learnt these notions (2020-2022). I also tryed to keep track of all the dead ends I explored and traps that caught me, they are often indicated as side remarks and notes; - I try to keep hardware requirement as minimal as possible. With the tiniest FPGA (IceStick Ice40HX1K) you can do the first episode of the tutorial and transform it into a fully functional RV32I microcontroller that can execute compiled C code. - in the end, the obtained processor is not the most efficient, but it is not a toy: it can execute any program. To anwser the question you may ask, yes, it [runs DOOM](https://github.com/BrunoLevy/learn-fpga/tree/master/LiteX/software/Doom)! (but not on an IceStick, you will need a larger FPGA). It works with the help of LiteX that has a nice SDRAM controller, because Doom needs some RAM; - the tutorial is both about hardware and software: you will learn how to compile programs in assembly and in C for your core; - I try to make all example programs fun and interesting while reasonably short. The bundled demo programs include: - mandelbrot set in assembly and in C - rotozoom graphic effect - drawing filled polygons - raytracing These graphic program are all displayed in text mode on the terminal, using ANSI escape sequences (yes, this makes BIG pixels). For more fun, it is also possible to use a small OLED display instead (will add instructions for that in the future). - [Episode II](PIPELINE.md) is on pipelining, you will learn there how to transform the basic processor obtained at the end of this tutorial into a more efficient pipelined processor with branch prediction. - [Episode III)(INTERRUPTS.md) is a WIP on interrupts and the priviledged RISC-V ISA. - This tutorial is in VERILOG. It is currently being ported into other HDLs - [Amaranth/nMigen version](https://github.com/bl0x/learn-fpga-amaranth) by @bl0x - TODO: Silice version - TODO: SpinalHDL version ## Introduction and references on processor design To understand processor design, the first thing that I have read was [this answer](https://stackoverflow.com/questions/51592244/implementation-of-simple-microprocessor-using-verilog/51621153#51621153) on Stackoverflow, that I found inspiring. There is also [this article](http://www.fpgacpu.org/papers/xsoc-series-drafts.pdf) suggested by @mithro. For a complete course, I highly recommend [this one from the MIT](http://web.mit.edu/6.111/www/f2016/), it also gives the principles for going much further than what I've done here (pipelines etc...). For Verilog basics and syntax, I read _Verilog by example by Blaine C. Readler_, it is also short and to the point. There are two nice things with the Stackoverflow answer: - it goes to the essential, and keeps nothing else than what's essential - the taken example is a RISC processor, that shares several similarities with RISC-V (except that it has status flags, that RISC-V does not have). What we learn there is that there will be a _register file_, that stores the so-called _general-purpose_ registers. By general-purpose, we mean that each time an instruction reads a register, it can be any of them, and each time an instruction writes a register, it can be any of them, unlike the x86 (CISC) that has _specialized_ registers. To implement the most general instruction (`register <- register OP register`), the register file will read two registers at each cycle, and optionally write-back one. There will be an _ALU_, that will compute an operation on two values. There will be also a _decoder_, that will generate all required internal signals from the bit pattern of the current instruction. If you want to design a RISC-V processor on your own, I recommend you take a deep look at [the Stackoverflow answer](https://stackoverflow.com/questions/51592244/implementation-of-simple-microprocessor-using-verilog/51621153#51621153), and do some schematics on your own to have all the general ideas in mind before going further... or you can choose to directly jump into this tutorial, one step at a time. It will gently take you from the most trivial Blinky design to a fully functional RISC-V core. ## Prerequisites: First step is cloning the learn-fpga repository: ``` $ git clone https://github.com/BrunoLevy/learn-fpga.git ``` Before starting, you will need to install the following softwares: - iverilog/icarus (simulation) ``` $ sudo apt-get install iverilog ``` - yosys/nextpnr, the toolchain for your board. See [this link](../toolchain.md). Note that iverilog/icarus is sufficient to run and play with all the steps of the tutorial, but the experience is not the same. I highly recommend to run each step on a real device. The feeling and excitation of your own processor running some code for the first time is not of the same magnitude when you are doing simulation !!! ## Step 1: your first blinky Let us start and create our first blinky ! Our blinky is implemented as a VERILOG module, connected to inputs and outputs, as follows ([step1.v](step1.v)): ```verilog module SOC ( input CLK, input RESET, output [4:0] LEDS, input RXD, output TXD ); reg [4:0] count = 0; always @(posedge CLK) begin count <= count + 1; end assign LEDS = count; assign TXD = 1'b0; // not used for now endmodule ``` We call it SOC (System On Chip), which is a big name for a blinky, but that's what our blinky will be morphed into after all the steps of this tutorial. Our SOC is connected to the following signals: - `CLK` (input) is the system clock. - `LEDS` (output) is connected to the 5 LEDs of the board. - `RESET` (input) is a reset button. You'll say that the IceStick has no button, but in fact ... (we'll talk about that later) - `RXD` and `TXD` (input,output) connected to the FTDI chip that emulates a serial port through USB. We'll also talk about that later. You can synthesize and send the bitstream to the device as follows: ``` $ BOARDS/run_xxx.sh step1.v ``` where `xxx` corresponds to your board. The five leds will light on... but they are not blinking. Why is this so ? In fact they are blinking, but it is too fast for you to distinguish anything. To see something, it is possible to use simulation. To use simulation, we write a new VERILOG file [bench_iverilog.v](bench_iverilog.v), with a module `bench` that encapsulates our `SOC`: ```verilog module bench(); reg CLK; wire RESET = 0; wire [4:0] LEDS; reg RXD = 1'b0; wire TXD; SOC uut( .CLK(CLK), .RESET(RESET), .LEDS(LEDS), .RXD(RXD), .TXD(TXD) ); reg[4:0] prev_LEDS = 0; initial begin CLK = 0; forever begin #1 CLK = ~CLK; if(LEDS != prev_LEDS) begin $display("LEDS = %b",LEDS); end prev_LEDS <= LEDS; end end endmodule ``` The module `bench` drives all the signals of our `SOC` (called `uut` here for "unit under test"). The `forever` loop wiggles the `CLK` signal and displays the status of the LEDs whenever it changes. Now we can start the simulation: ``` $ iverilog -DBENCH -DBOARD_FREQ=10 bench_iverilog.v step1.v $ vvp a.out ``` ... but that's a lot to remember, so I created a script for that, you'll prefer to do: ``` $ ./run.sh step1.v ``` You will see the LEDs counting. Simulation is precious, it lets you insert "print" statements (`$display`) in your VERILOG code, which is not directly possible when you run on the device ! To exit the simulation: ``` finish ``` _Note: I developped the first version of femtorv completely on device, using only the LEDs to debug because I did not know how to use simulation, don't do that, it's stupid !_ **Try this** How would you modify `step1.v` to slow it down sufficiently for one to see the LEDs blinking ? **Try this** Can you implement a "Knight driver"-like blinking pattern instead of counting ? ## Step 2: slower blinky You probably got it right: the blinky can be slowed-down either by counting on a larger number of bits (and wiring the most significant bits to the leds), or inserting a "clock divider" (also called a "gearbox") that counts on a large number of bits (and driving the counter with its most significant bit). The second solution is interesting, because you do not need to modify your design, you just insert the clock divider between the `CLK` signal of the board and your design. Then, even on the device you can distinguish what happens with the LEDs. To do that, I created a `Clockworks` module in [clockworks.v](clockworks.v), that contains the gearbox and a mechanism related with the `RESET` signal (that I'll talk about later). `Clockworks` is implemented as follows: ```verilog module Clockworks ( input CLK, // clock pin of the board input RESET, // reset pin of the board output clk, // (optionally divided) clock for the design. output resetn // (optionally timed) negative reset for the design (more on this later) ); parameter SLOW; ... reg [SLOW:0] slow_CLK = 0; always @(posedge CLK) begin slow_CLK <= slow_CLK + 1; end assign clk = slow_CLK[SLOW]; ... endmodule ``` This divides clock frequency by `2^SLOW`. The `Clockworks` module is then inserted between the `CLK` signal of the board and the design, using an internal `clk` signal, as follows, in [step2.v](step2.v): ```verilog `include "clockworks.v" module SOC ( input CLK, // system clock input RESET, // reset button output [4:0] LEDS, // system LEDs input RXD, // UART receive output TXD // UART transmit ); wire clk; // internal clock wire resetn; // internal reset signal, goes low on reset // A blinker that counts on 5 bits, wired to the 5 LEDs reg [4:0] count = 0; always @(posedge clk) begin count <= !resetn ? 0 : count + 1; end // Clock gearbox (to let you see what happens) // and reset circuitry (to workaround an // initialization problem with Ice40) Clockworks #( .SLOW(21) // Divide clock frequency by 2^21 )CW( .CLK(CLK), .RESET(RESET), .clk(clk), .resetn(resetn) ); assign LEDS = count; assign TXD = 1'b0; // not used for now endmodule ``` It also handles the `RESET` signal. Now you can try it on simulation: ``` $ ./run.sh step2.v ``` As you can see, the counter is now much slower. Try it also on device: ``` $ BOARDS/run_xxx.sh step2.v ``` Yes, now we can see clearly what happens ! And what about the `RESET` button ? The IceStick has no button. In fact it has one ! ![](IceStick_RESET.jpg) Press a finger on the circled region of the image (around pin 47). **Try this** Knight-driver mode, and `RESET` toggles direction. If you take a look at [clockworks.v](clockworks.v), you will see it can also create a `PLL`, it is a component that can be used to generate *faster* clocks. For instance, the IceStick has a 12 MHz system clock, but the core that we will generate will run at 45 MHz. We will see that later. ## Step 3: a blinker that loads LEDs patterns from ROM Now we got all the tools that we need, so let's see how to transform this blinker into a fully-functional RISC-V processor. This goal seems to be far far away, but the processor we will have created at step 16 is not longer than 200 lines of VERILOG ! I was amazed to discover that it is that simple to create a processor. OK, let us go there one step at a time. We know already that a processor has a memory, and fetches instructions from there, in a sequential manner most of the time (except when there are jumps and branches). Let us start with something similar, but much simpler: a pre-programmed christmas tinsel, that loads the LEDs pattern from a memory (see [step3.v](step3.v)). Our tinsel has a memory with the patterns: ```verilog reg [4:0] MEM [0:20]; initial begin MEM[0] = 5'b00000; MEM[1] = 5'b00001; MEM[2] = 5'b00010; MEM[3] = 5'b00100; ... MEM[19] = 5'b10000; MEM[20] = 5'b00000; end ``` _Note that what's in the initial block does not generate any circuitry when synthesized, it is directly translated into the initialization data for the BRAMs of the FPGA._ We will also have a "program counter" `PC` incremented at each clock, and a mechanism to fetch `MEM` contents indexed by `PC`: ```verilog reg [4:0] PC = 0; reg [4:0] leds = 0; always @(posedge clk) begin leds <= MEM[PC]; PC <= (!resetn || PC==20) ? 0 : (PC+1); end ``` _Note the test `PC==20` to make it cycle._ Now try it with simulation and on device. **Try this** create several blinking modes, and switch between modes using `RESET`. ## The RISC-V instruction set architecture An important source of information is of course the [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf). There you learn that there are several flavors of the RISC-V standard. Let us start from the simplest one (RV32I, that is, 32 bits base integer instruction set). Then we will see how to add things, one thing at a time. This is a very nice feature of RISC-V, since the instruction set is _modular_, you can start with a very small self-contained kernel, and this kernel will be compliant with the norm. This means standard tools (compiler, assembler, linker) will be able to generate code for this kernel. Then I started reading Chapter 2 (page 13 to page 30). Seeing also the table page 130, there are in fact only 11 different instrutions ! (I say for instance that an AND, an OR, an ADD ... are the same instruction, the operation is just an additional parameter). Now we just try to have an idea of the overall picture, no need to dive into the details for now. Let's take a global look at these 11 instructions: | instruction | description | algo | |-------------|--------------------------------------|--------------------------------------| | branch | conditional jump, 6 variants | `if(reg OP reg) PC<-PC+imm` | | ALU reg | Three-registers ALU ops, 10 variants | `reg <- reg OP reg` | | ALU imm | Two-registers ALU ops, 9 variants | `reg <- reg OP imm` | | load | Memory-to-register, 5 variants | `reg <- mem[reg + imm]` | | store | Register-to-memory, 3 variants | `mem[reg+imm] <- reg` | | `LUI` | load upper immediate | `reg <- (im << 12)` | | `AUIPC` | add upper immediate to PC | `reg <- PC+(im << 12)` | | `JAL` | jump and link | `reg <- PC+4 ; PC <- PC+imm` | | `JALR` | jump and link register | `reg <- PC+4 ; PC <- reg+imm` | | `FENCE` | memory-ordering for multicores | (not detailed here, skipped for now) | | `SYSTEM` | system calls, breakpoints | (not detailed here, skipped for now) | - The 6 branch variants are conditional jumps, that depend on a test on two registers. - ALU operations can be of the form `register <- register OP register` or `register <- register OP immediate` - Then we have load and store, that can operate on bytes, on 16 bit values (called half-words) or 32 bit values (called words). In addition byte and half-word loads can do sign expansion. The source/target address is obtained by adding an immediate offset to the content of a register. - The remaining instructions are more special (one may skip their description in a first read, you just need to know that they are used to implement unconditional jumps, function calls, memory ordering for multicores, system calls and breaks): - `LUI` (load upper immediate) is used to load the upper 20 bits of a constant. The lower bits can then be set using `ADDI` or `ORI`. At first sight it may seem weird that we need two instructions to load a 32 bit constant in a register, but in fact it is a smart choice, because all instructions are 32-bit long. - `AUIPC` (add upper immediate to PC) adds a constant to the current program counter and places the result in a register. It is meant to be used in combination with `JALR` to reach a 32-bit PC-relative address. - `JAL` (jump and link) adds an offset to the PC and stores the address of the instruction following the jump in a register. It can be used to implement function calls. `JALR` does the same thing, but adds the offset to a register. - `FENCE` and `SYSTEMS` are used to implement memory ordering in multicore systems, and system calls/breaks respectively. To summarize, we got branches (conditional jumps), ALU operations, load and store, and a couple of special instructions used to implement unconditional jumps and function calls. There are also two functions for memory ordering and system calls (but we will ignore these two ones for now). OK, in fact only 9 instructions then, it seems doable... At this point, I had not understood everything, so I'll start from what I think to be the simplest parts (intruction decoder, register file and ALU), then we will see how things are interconnected, how to implement jumps, branches, and all the instructions. ## Step 4: the instruction decoder Now the idea is to have a memory with RISC-V instructions in it, load all instructions sequentially (like in our christmas tinsel), in an `instr` register, and see how to recognize among the 11 instructions (and light a different LED in function of the recognized instruction). Each instruction is encoded in a 32-bits word, and we need to decode the different bits of this word to recognize the instruction and its parameters. The [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf) has all the information that we need summarized in two tables in page 130 (RV32/64G Instruction Set Listings). Let us take a look at the big table, first thing to notice is that the 7 LSBs tells you which instruction it is (there are 10 possibilities, we do not count `FENCE` for now). ```verilog reg [31:0] instr; ... wire isALUreg = (instr[6:0] == 7'b0110011); // rd <- rs1 OP rs2 wire isALUimm = (instr[6:0] == 7'b0010011); // rd <- rs1 OP Iimm wire isBranch = (instr[6:0] == 7'b1100011); // if(rs1 OP rs2) PC<-PC+Bimm wire isJALR = (instr[6:0] == 7'b1100111); // rd <- PC+4; PC<-rs1+Iimm wire isJAL = (instr[6:0] == 7'b1101111); // rd <- PC+4; PC<-PC+Jimm wire isAUIPC = (instr[6:0] == 7'b0010111); // rd <- PC + Uimm wire isLUI = (instr[6:0] == 7'b0110111); // rd <- Uimm wire isLoad = (instr[6:0] == 7'b0000011); // rd <- mem[rs1+Iimm] wire isStore = (instr[6:0] == 7'b0100011); // mem[rs1+Simm] <- rs2 wire isSYSTEM = (instr[6:0] == 7'b1110011); // special ``` Besides the instruction type, we need also to decode the arguments of the instruction. The table on the top distinguishes 6 types of instructions (`R-type`,`I-type`,`S-type`,`B-type`,`U-type`,`J-type`), depending on the arguments of the instruction and how they are encoded within the 32 bits of the instruction word. `R-type` instructions take two source registers `rs1` and `rs2`, apply an operation on them and stores the result in a third destination register `rd` (`ADD`, `SUB`, `SLL`, `SLT`, `SLTU`, `XOR`, `SRL`, `SRA`, `OR`, `AND`). Since RISC-V has 32 registers, each of `rs1`,`rs2` and `rd` use 5 bits of the instruction word. Interestingly, these are the same bits for all instruction formats. Hence, "decoding" `rs1`,`rs2` and `rd` is just a matter of drawing some wires from the instruction word: ```verilog wire [4:0] rs1Id = instr[19:15]; wire [4:0] rs2Id = instr[24:20]; wire [4:0] rdId = instr[11:7]; ``` Then, one needs to recognize among the 10 R-type instructions. It is done mostly with the `funct3` field, a 3-bits code. With a 3-bits code, one can only encode 8 different instructions, hence there is also a `funct7` field (7 MSBs of instruction word). Bit 30 of the instruction word encodes `ADD`/`SUB` and `SRA`/`SRL` (arithmetic right shift with sign expansion/logical right shift). The instruction decoder has wires for `funct3` and `funct7`: ```verilog wire [2:0] funct3 = instr[14:12]; wire [6:0] funct7 = instr[31:25]; ``` `I-type` instructions take one register `rs1`, an immediate value `Iimm`, applies an operation on them and stores the result in the destination register `rd` (`ADDI`, `SLTI`, `SLTIU`, `XORI`, `ORI`, `ANDI`, `SLLI`, `SRLI`, `SRAI`). _Wait a minute:_ there are 10 R-Type instructions but only 9 I-Type instructions, why is this so ? If you look carefully, you will see that there is no `SUBI`, but one can instead use `ADDI` with a negative immediate value. This is a general rule in RISC-V, if an existing functionality can be used, do not create a new functionality. As for R-type instructions, the instruction can be distinguished using `funct3` and `funct7` (and in `funct7`, only the bit 30 of the instruction word is used, to distinguish `SRAI`/`SRLI` arithmetic and logical right shifts). The immediate value is encoded in the 12 MSBs of the instruction word, hence we will draw additional wires to get it: ```verilog wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]}; ``` As can be seen, bit 31 of the instruction word is repeated 21 times, this is "sign expansion" (converts a 12-bits signed quantity into a 32-bits one). There are four other instruction formats `S-type` (for Store), `B-type` (for Branch), `U-type` (for Upper immediates that are left-shifted by 12), and `J-type` (for Jumps). Each instruction format has a different way of encoding an immediate value in the instruction word. To understand what it means, let's get back to Chapter 2, page 16. The different instruction types correspond to the way _immediate values_ are encoded in them. | Instr. type | Description | Immediate value encoding | |-------------|------------------------------------------------|------------------------------------------------------| | `R-type` | register-register ALU ops. [more on this here](https://www.youtube.com/watch?v=pVWtI0426mU) | None | | `I-type` | register-immediate integer ALU ops and `JALR`. | 12 bits, sign expansion | | `S-type` | store | 12 bits, sign expansion | | `B-type` | branch | 12 bits, sign expansion, upper `[31:1]` (bit 0 is 0) | | `U-type` | `LUI`,`AUIPC` | 20 bits, upper `31:12` (bits `[11:0]` are 0) | | `J-type` | `JAL` | 12 bits, sign expansion, upper `[31:1]` (bit 0 is 0) | Note that `I-type` and `S-type` encode the same type of values (but they are taken from different parts of `instr`). Same thing for `B-type` and `J-type`. One can decode the different types of immediates as follows: ```verilog wire [31:0] Uimm={ instr[31], instr[30:12], {12{1'b0}}}; wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]}; wire [31:0] Simm={{21{instr[31]}}, instr[30:25],instr[11:7]}; wire [31:0] Bimm={{20{instr[31]}}, instr[7],instr[30:25],instr[11:8],1'b0}; wire [31:0] Jimm={{12{instr[31]}}, instr[19:12],instr[20],instr[30:21],1'b0}; ``` Note that `Iimm`, `Simm`, `Bimm` and `Jimm` do sign expansion (by copying bit 31 the required number of times to fill the MSBs). And that's all for our instruction decoder ! To summarize, the instruction decoder gets the following information from the instruction word: - signals isXXX that recognizes among the 11 possible RISC-V instructions - source and destination registers `rs1`,`rs2` and `rd` - function codes `funct3` and `funct7` - the five formats for immediate values (with sign expansion for `Iimm`, `Simm`, `Bimm` and `Jimm`). Let us now initialize the memory with a few RISC-V instruction and see whether we can recognize them by lighting a different LED depending on the instruction ([step4.v](step4.v)). To do that, we use the big table in page 130 of the [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf). It is a bit painful (we will see easier ways later !). Using the `_` character to separate fields of a binary constant is especially interesting under this circumstance. ```verilog initial begin // add x1, x0, x0 // rs2 rs1 add rd ALUREG MEM[0] = 32'b0000000_00000_00000_000_00001_0110011; // addi x1, x1, 1 // imm rs1 add rd ALUIMM MEM[1] = 32'b000000000001_00001_000_00001_0010011; ... // lw x2,0(x1) // imm rs1 w rd LOAD MEM[5] = 32'b000000000000_00001_010_00010_0000011; // sw x2,0(x1) // imm rs2 rs1 w imm STORE MEM[6] = 32'b000000_00001_00010_010_00000_0100011; // ebreak // SYSTEM MEM[7] = 32'b000000000001_00000_000_00000_1110011; end ``` Then we can fetch and recognize the instructions as follows: ```verilog always @(posedge clk) begin if(!resetn) begin PC <= 0; end else if(!isSYSTEM) begin instr <= MEM[PC]; PC <= PC+1; end end assign LEDS = isSYSTEM ? 31 : {PC[0],isALUreg,isALUimm,isStore,isLoad}; ``` (first led is wired to `PC[0]` so that we will see it blinking even if there is the same instruction several times). As you can see, the program counter is only incremented if instruction is not `SYSTEM`. For now, the only `SYSTEM` instruction that we support is `EBREAK`, that halts execution. In simulation mode, we can in addition display the name of the recognized instruction and the fields: ```verilog `ifdef BENCH always @(posedge clk) begin $display("PC=%0d",PC); case (1'b1) isALUreg: $display("ALUreg rd=%d rs1=%d rs2=%d funct3=%b",rdId, rs1Id, rs2Id, funct3); isALUimm: $display("ALUimm rd=%d rs1=%d imm=%0d funct3=%b",rdId, rs1Id, Iimm, funct3); isBranch: $display("BRANCH"); isJAL: $display("JAL"); isJALR: $display("JALR"); isAUIPC: $display("AUIPC"); isLUI: $display("LUI"); isLoad: $display("LOAD"); isStore: $display("STORE"); isSYSTEM: $display("SYSTEM"); endcase end `endif ``` **Try this** run `step4.v` in simulation and on the device. Try initializing the memory with different RISC-V instruction and test whether the decoder recognizes them. ## Sidebar: the elegance of RISC-V This paragraph may be skipped. it just contains my own impressions and reflexions on the RISC-V instruction set, inspired by the comments and Q&A in italics in the [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf). At this point, I realized what an _instruction set architecture_ means: it is for sure a specification of _what bit pattern does what_ (Instruction Set) and it is also at the same time driven by how this will be translated into wires (Architecture). An ISA is not _abstract_, it is _independent_ on an implementation, but it is strongly designed with implementation in mind ! While the pipeline, branch prediction unit, multiple execution units, caches may differ in different implementations, the instruction decoder is probably very similar in all implementations. There were things that seemed really weird to me in the first place: all these immediate format variants, the fact that immediate values are scrambled in different bits of `instr`, the `zero` register, and the weird instructions `LUI`,`AUIPC`,`JAL`,`JALR`. When writing the instruction decoder, you better understand the reasons. The ISA is really smart, and is the result of a long evolution (there were RISC-I, RISC-II, ... before). It seems to me the result of a _distillation_. Now, in 2020, many things were tested in terms of ISA, and this one seems to have benefited from all the previous attempts, taking the good choices and avoiding the suboptimal ones. What is really nice in the ISA is: - instruction size is fixed. Makes things really easier. _(there are extension with varying instrution length, but at least the core instruction set is simple)_; - `rs1`,`rs2`,`rd` are always encoded by the same bits of `instr`; - the immediate formats that need to do sign expansion do it from the same bit (`instr[31]`); - the weird instructions `LUI`,`AUIPC`,`JAL`,`JALR` can be combined to implement higher-level tasks (load 32-bit constant in register, jump to arbitrary address, function calls). Their existence is justified by the fact it makes the design easier. Then assembly programmer's life is made easier by _pseudo-instructions_ `CALL`, `RET`, ... See [risc-v assembly manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md), the two tables at the end of the page. Same thing for tests/branch instructions obtained by swapping parameters (e.g. `a < b <=> b > a` etc...), there are pseudo-instructions that do the job for you. Put differently, to appreciate the elegance of the RISC-V ISA, imagine that your mission is to _invent it_. That is, invent both the set of instructions and the way they are encoded as bit patterns. The constraints are: - fixed instruction length (32 bits) - as simple as possible: the ultimate sophistication is simplicity [Leonardo da Vinci] !! - source and destination registers always encoded at the same position - whenever there is sign-extension, it should be done from the same bit - it should be simple to load an arbitrary 32-bits immediate value in a register (but may take several instructions) - it should be simple to jump to arbitrary memory locations (but may take several instructions) - it should be simple to implement function calls (but may take several instructions) Then you understand why there are many different immediate formats. For instance, consider `JAL`, that does not have a source register, as compared to `JALR` that has one. Both take an immediate value, but `JAL` has 5 more bits available to store it, since it does not need to encode the source register. The slightest available bit is used to extend the dynamic range of the immediates. This explains both the multiple immediate formats and the fact that they are assembled from multiple pieces of `instr`, slaloming between the three fixed 5-bits register encodings, that are there or not depending on the cases. Now the rationale behind the weird instructions `LUI`,`AUIPC`,`JAL` and `JALR` is to give a set of functions that can be combined to load arbitrary 32-bit values in register, or to jump to arbitrary locations in memory, or to implement the function call protocol as simply as possible. Considering the constraints, the taken choices (that seemed weird to me in the first place) perfectly make sense. In addition, with the taken choices, the instruction decoder is pretty simple and has a low logical depth. Besides the 7-bits instruction decoder, it mostly consists of a set of wires drawn from the bits of `instr`, and duplication of the sign-extended bit 31 to form the immediate values. Before moving forward, I'd like to say a word about the `zero` register. I think it is really a smart move. With it, you do not need a `MOV rd rs` instruction (just `ADD rd rs zero`), you do not need a `NOP` instruction (`ADD zero zero zero`), and all the branch variants can compare with `zero` ! I think that `zero` is a great invention, not as great as `0`, but really makes the instruction set more compact. ## Step 5: The register bank and the state machine The register bank is implemented as follows: ```verilog reg [31:0] RegisterBank [0:31]; ``` Let us take a closer look at what we need to to to execute an instruction. Consider for instance a stream of R-type instructions. For each instruction, we need to do the following four things: - fetch the instruction: `instr <= MEM[PC]` - fetch the values of `rs1` and `rs2`: `rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id]` where `rs1` and `rs2` are two registers. We need to do that because `RegisterBank` will be synthesized as a block of BRAM, and one needs one cycle to access the content of BRAM. - compute `rs1` `OP` `rs2` (where `OP` depends on `funct3` and `funct7`) - store the result in `rd`: `RegisterBank[rdId] <= writeBackData`. This can be done during the same cycle as the previous step if `OP` is computed by a combinatorial circuit. The first three operations are implemented by a state machine, as follows (see [step5.v](step5.v)): ```verilog localparam FETCH_INSTR = 0; localparam FETCH_REGS = 1; localparam EXECUTE = 2; reg [1:0] state = FETCH_INSTR; always @(posedge clk) begin case(state) FETCH_INSTR: begin instr <= MEM[PC]; state <= FETCH_REGS; end FETCH_REGS: begin rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id]; state <= EXECUTE; end EXECUTE: begin PC <= PC + 1; state <= FETCH_INSTR; end endcase end end ``` The fourth one (register write-back) is implemented in this block: ```verilog wire [31:0] writeBackData = ... ; wire writeBackEn = ...; always @posedge(clk) begin if(writeBackEn && rdId != 0) begin RegisterBank[rdId] <= writeBackData; end end ``` Remember that writing to register 0 has no effect (hence the test `rdId != 0`). The signal `writeBackEn` is asserted whenever `writeBackData` should be written to register `rdId`. The data to be written back (`writeBackData`) will be obtained from the ALU, as explained in the next episode. **Try this**: run [step5.v](step5.v) in simulation and on the device. You will see your wannabe CPU's state machine dancing waltz on the LEDs (that display the current state). ## Step 6: the ALU Now we can fetch instructions from memory, decode them and read register values, but our (wannabe) CPU is still unable to do anything. Let us see how to do actual computations on register's values. _So, are you going to create an `ALU` module ? And by the way, why did not you create a `Decoder` module, and a `RegisterBank` module ?_ My very first design used multiple modules and multiple files, for a total of 1000 lines of code or so, then Matthias Koch wrote a monolithic version, that fits in 200 lines of code. Not only it is more compact, but also it is much easier to understand when you got everything in one place. **Rule of thumb:** if you have more boxes and wires between the boxes than circuitry in the boxes, then you have too many boxes ! _But wait a minute, modular design is good, no ?_ Modular design is neither good nor bad, it is useful whenever it makes things simpler. It is not the case in the present situation. There is no absolute answer though, it is a matter of taste and style ! In this tutorial, we use a (mostly) monolithic design. Now we want to implement two types of instructions: - Rtype: `rd` <- `rs1` `OP` `rs2` (recognized by `isALUreg`) - Itype: `rd` <- `rs1` `OP` `Iimm` (recognized by `isALUimm`) The ALU takes two inputs `aluIn1` and `aluIn2`, computes `aluIn1` `OP` `aluIn2` and stores it in `aluOut`: ```verilog wire [31:0] aluIn1 = rs1; wire [31:0] aluIn2 = isALUreg ? rs2 : Iimm; reg [31:0] aluOut; ``` Depending on the instruction type, `aluIn2` is either the value in the second source register `rs2`, or an immediate in the `Itype` format (`Immm`). The operation `OP` depends mostly on `funct3` (and also on `funct7`). Keep a copy of the [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf) open page 130 on your knees or in another window: | funct3 | operation | |--------|-----------------------------------------------| | 3'b000 | `ADD` or `SUB` | | 3'b001 | left shift | | 3'b010 | signed comparison (<) | | 3'b011 | unsigned comparison (<) | | 3'b100 | `XOR` | | 3'b101 | logical right shift or arithmetic right shift | | 3'b110 | `OR` | | 3'b111 | `AND` | - for `ADD`/`SUB`, if its an `ALUreg` operation (Rtype), then one makes the difference between `ADD` and `SUB` by testing bit 5 of `funct7` (1 for `SUB`). If it is an `ALUimm` operation (Itype), then it can be only `ADD`. In this context, one just needs to test bit 5 of `instr` to distinguish between `ALUreg` (if it is 1) and `ALUimm` (if it is 0). - for logical or arithmetic right shift, one makes the difference also by testing bit 5 of `funct7`, 1 for arithmetic shift (with sign expansion) and 0 for logical shift. - the shift amount is either the content of `rs2` for `ALUreg` instructions or `instr[24:20]` (the same bits as `rs2Id`) for `ALUimm` instructions. Putting everything together, one gets the following VERILOG code for the ALU: ```verilog reg [31:0] aluOut; wire [4:0] shamt = isALUreg ? rs2[4:0] : instr[24:20]; // shift amount always @(*) begin case(funct3) 3'b000: aluOut = (funct7[5] & instr[5]) ? (aluIn1-aluIn2) : (aluIn1+aluIn2); 3'b001: aluOut = aluIn1 << shamt; 3'b010: aluOut = ($signed(aluIn1) < $signed(aluIn2)); 3'b011: aluOut = (aluIn1 < aluIn2); 3'b100: aluOut = (aluIn1 ^ aluIn2); 3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) : (aluIn1 >> shamt); 3'b110: aluOut = (aluIn1 | aluIn2); 3'b111: aluOut = (aluIn1 & aluIn2); endcase end ``` _Note:_ although it is declared as a `reg`, `aluOut` will be a combinatorial function (no flipflop generated), because its value is determined in a combinatorial block (`always @(*)`), and all the configurations are enumerated in the `case` statement. Register write-back is configured as follows: ```verilog assign writeBackData = aluOut; assign writeBackEn = (state == EXECUTE && (isALUreg || isALUimm)); ``` **Try this** run [step6.v](step6.v) in simulation and on the device. In simulation it will display the written value and the written register for all register write-back operation. On the device it will show the 5 LSBs of `x1` on the LEDs. Then you can try changing the program, and observe the effect on register values. **You are here !** This is the list of instructions you have to implement, your wannabe RISC-V core currently supports 20 of them. Next steps: jumps, then branches, then... the rest. Before then, as you probably have noticed, translating RISC-V programs into binary (that is, assembling manually) is extremely painful. Next section gives a much easier solution. | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [ ] 2 | [ ] 6 | [ ] | [ ] | [ ] 5 | [ ] 3 | [*] 1 | ## Step 7: using the VERILOG assembler To avoid having to manually translate RISC-V assembly into binary, one can use the GNU assembler, generate a binary file, translate it into hexadecimal and use the VERILOG function `readmemh()` to initialize memory with the content of that file. We will see later how to do that. But in our case, it would be very convenient to be able to write small assembly programs directly in the same VERILOG file as our design. In fact, it is possible to do so, by implementing a RISC-V assembler directly in VERILOG (using tasks and functions), as done in [riscv_assembly.v](riscv_assembly.v). In [step7.v](step7.v), memory is initialized with the same assembly program as in [step6.v](step6.v). It looks like that now, Much easier to read, no ? ```verilog `include "riscv_assembly.v" initial begin ADD(x0,x0,x0); ADD(x1,x0,x0); ADDI(x1,x1,1); ADDI(x1,x1,1); ADDI(x1,x1,1); ADDI(x1,x1,1); ADD(x2,x1,x0); ADD(x3,x1,x2); SRLI(x3,x3,3); SLLI(x3,x3,31); SRAI(x3,x3,5); SRLI(x1,x3,26); EBREAK(); end ``` _Note:_ `riscv_assembly.v` needs to be included from inside the module that uses assembly. In this step, we make another modification: in the previous steps, `PC` was the index of the current instruction. For what follows, we want it to be the _address_ of the current instruction. Since each instruction is 32-bits long, it means that: - to increment `PC`, we do `PC <= PC + 4` (instead of `PC <= PC + 1` as before) - to fetch the current instruction, we do `instr <= MEM[PC[31:2]];` (we ignore the two LSBs of `PC`). ## Step 8: jumps There are two jump instructions, `JAL` (jump and link), and `JALR` (jump and link register). By "and link", one means that the current PC can be written to a register. Hence `JAL` and `JALR` can be used to implement not only jumps, but also function calls. Here is what the two instructions are supposed to do: | instruction | effect | |-----------------|-------------------------| | JAL rd,imm | rd<-PC+4; PC<-PC+Jimm | | JALR rd,rs1,imm | rd<-PC+4; PC<-rs1+Iimm | To implement these two instructions, we need to make the following changes to our core. First thing is register write-back: now value can be `PC+4` instead of `aluOut` for jump instructions: ```verilog assign writeBackData = (isJAL || isJALR) ? (PC + 4) : aluOut; assign writeBackEn = (state == EXECUTE && (isALUreg || isALUimm || isJAL || isJALR) ); ``` We also need to declare a `nextPC` value, that implements the three possibilities: ```verilog wire [31:0] nextPC = isJAL ? PC+Jimm : isJALR ? rs1+Iimm : PC+4; ``` Then, in the state machine, the line `PC <= PC + 4;` is replaced with `PC <= nextPC;` and that's all ! We can now implement a simple (infinite) loop to test our new jump instruction: ```verilog `include "riscv_assembly.v" integer L0_=4; initial begin ADD(x1,x0,x0); Label(L0_); ADDI(x1,x1,1); JAL(x0,LabelRef(L0_)); EBREAK(); endASM(); end ``` The integer `L0_` is a label. Unlike with a real assembler, we need to specify the value of `L0_` by hand. Here it is easy, `L0_` is right after the first instruction, hence it corresponds to the beginning of the RAM (0) plus one 32-bits words, that is, 4. For longer programs with many labels, you can let the labels uninitialized (`integer L0_;`) then the first time you run the program, it will compute and display the values to be used for the labels. It is not super-convenient, but still much better than assembling by hand / determining the labels by hand. The `LabelRef()` function computes the label's offset relative to the current program counter. In addition, in simulation mode, it displays the current address (to be used to initialize the label), and if the label was already initialized (like here with `L0_=4`) it checks that the label corresponds to the current address generated by the assembler. If it is not the case, the `endASM()` statement displays an error message and exits. _Note 1_: I systematically insert an `EBREAK()` instruction at the end of the program, here it would not be necessary (we have an infinite loop), but if I change my mind and exit the loop, then `EBREAK()` is already there. _Note 2_: the `endASM();` statement checks the validity of all the labels and exits simulation whenever an invalid label is detected. If you use the RISC-V VERILOG assembler, systematically run your design in simulation before synthesizing (because this verification cannot be done at synthesis time). **Try this** Run the design [step8.v](step8.v) in simulation and on the device. Yes, after 8 steps, what we have is just another stupid blinky ! But this time, this blinky is executing a real RISC-V program ! It is not a complete RISC-V core yet, but it starts to have a strong RISC-V flavor. Be patient, our core will be soon able to run RISC-V programs that are more interesting than a blinky. **You are here !** Still some work to do, but we are making progress. | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [*] 2 | [ ] 6 | [ ] | [ ] | [ ] 5 | [ ] 3 | [*] 1 | **Try this** add a couple of instructions before the loop, run in simulation, fix the label as indicated by the simulator, re-run in simulation, run on device. ## Step 9: Branches Branches are like jumps, except that they compare two register, and update `PC` based on the result of the comparison. Another difference is that they are more limited in the address range they can reach from `PC` (12-bits offset). There are 6 different branch instructions: | instruction | effect | |------------------|----------------------------------------------------| | BEQ rs1,rs2,imm | if(rs1 == rs2) PC <- PC+Bimm | | BNE rs1,rs2,imm | if(rs1 != rs2) PC <- PC+Bimm | | BLT rs1,rs2,imm | if(rs1 < rs2) PC <- PC+Bimm (signed comparison) | | BGE rs1,rs2,imm | if(rs1 >= rs2) PC <- PC+Bimm (signed comparison) | | BLTU rs1,rs2,imm | if(rs1 < rs2) PC <- PC+Bimm (unsigned comparison) | | BGEU rs1,rs2,imm | if(rs1 >= rs2) PC <- PC+Bimm (unsigned comparison) | _Wait a minute:_ there is `BLT`, but where is `BGT` ? Always the same principle in a RISC-V processor: if something can be done with a functionality that is already there, do not add a new functionality ! In this case, `BGT rs1,rs2,imm` is equivalent to `BLT rs2,rs1,imm` (just swap the first two operands). If you use `BGT` in a RISC-V assembly program, it will work (and the assembler replaces it with `BLT` with swapped operands). `BGT` is called a "pseudo-instruction". There are many pseudo-instructions to make RISC-V assembly programmer's life easier (more on this later). Back to our branch instructions, we will need to add in the ALU some wires to compute the result of the test, as follows: ```verilog reg takeBranch; always @(*) begin case(funct3) 3'b000: takeBranch = (rs1 == rs2); 3'b001: takeBranch = (rs1 != rs2); 3'b100: takeBranch = ($signed(rs1) < $signed(rs2)); 3'b101: takeBranch = ($signed(rs1) >= $signed(rs2)); 3'b110: takeBranch = (rs1 < rs2); 3'b111: takeBranch = (rs1 >= rs2); default: takeBranch = 1'b0; endcase ``` _Note 1_ it is possible to create a much more compact ALU, that uses a much smaller number of LUTs when synthesized, we sill see that later (for now, our goal is to have a RISC-V processor that works, we will optimize it later). _Note 2_ Among the 8 possibilites given by `funct3`, only 6 of them are used by the branch instructions. It is necessary to have a `default:` statement in the `case`, else the synthesizer would not be able to keep `takeBranch` as purely combinatorial (and would generate a latch, which we do not want). Now the only thing that remains to do for implementing branches is to add a case for `nextPC`, as follows: ```verilog wire [31:0] nextPC = (isBranch && takeBranch) ? PC+Bimm : isJAL ? PC+Jimm : isJALR ? rs1+Iimm : PC+4; ``` We are now ready to test a simple loop, that counts from 0 to 31, displays each iteration on the LEDs (remember, they are wired to `x1`) and stops: ```c++ `include "riscv_assembly.v" integer L0_ = 8; initial begin ADD(x1,x0,x0); ADDI(x2,x0,32); Label(L0_); ADDI(x1,x1,1); BNE(x1, x2, LabelRef(L0_)); EBREAK(); endASM(); end ``` **Try this** run [step9.v](step9.v) in simulation and on device. Try modifying the program, create a "knight driver" blinky with an outer loop and two inner loops (one left to right and one right to left). **You are here !** Wow, we have implemented 28 instructions out of 38 ! Let us continue... | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [*] 2 | [ *] 6 | [ ] | [ ] | [ ] 5 | [ ] 3 | [*] 1 | ## Step 10: LUI and AUIPC We still have these two weird instructions to implement. What do they do ? It is rather simple: | instruction | effect | |---------------|-----------------| | LUI rd, imm | rd <= Uimm | | AUIPC rd, imm | rd <= PC + Uimm | And if you look at the `Uimm` format, it reads its MSBs (`imm[31:12]`) from the immediate encoded in the instructions. The 12 LSBs are set to zero. These two instructions are super useful: the immediate formats supported by all the other instructions can only modify the LSBs. Combined with these two functions, one can load an arbitrary value in a register (but this can require up to two instructions). Implementing these two instructions just requires to change `writeBackEn` and `writeBackData` as follows: ```verilog assign writeBackData = (isJAL || isJALR) ? (PC + 4) : (isLUI) ? Uimm : (isAUIPC) ? (PC + Uimm) : aluOut; assign writeBackEn = (state == EXECUTE && (isALUreg || isALUimm || isJAL || isJALR || isLUI || isAUIPC) ); ``` **You are here !** Seems that we are nearly there ! 8 instructions to go... | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [*] 2 | [ *] 6 | [*] | [*] | [ ] 5 | [ ] 3 | [*] 1 | **Try this** run [step10.v](step10.v) in simulation and on the device. _Argh !!_ On my icestick, it does not fit (requires 1283 LUTs and the IceStick only has 1280). What can we do ? Remember, we absolutely took no care about resource consumption, just trying to write a design that works. In fact, there is _a lot_ of room for improvement in our design, we will see that later, but before then, let's organize our SOC a bit better (then we will shrink the processor). ## Step 11: Memory in a separate module In our previous designs, we got everything in our `SOC` module (memory and processor). In this step, we will see how to separate them. First, the `Memory` module: ```verilog module Memory ( input clk, input [31:0] mem_addr, // address to be read output reg [31:0] mem_rdata, // data read from memory input mem_rstrb // goes high when processor wants to read ); reg [31:0] MEM [0:255]; `include "riscv_assembly.v" integer L0_=8; initial begin ADD(x1,x0,x0); ADDI(x2,x0,31); Label(L0_); ADDI(x1,x1,1); BNE(x1, x2, LabelRef(L0_)); EBREAK(); endASM(); end always @(posedge clk) begin if(mem_rstrb) begin mem_rdata <= MEM[mem_addr[31:2]]; end end endmodule ``` In its interface, there is a `clk` signal connected to the clock. Whenever the processor wants to read in memory, it positions the address to be read on `mem_addr`, and sets `mem_rstrb` to 1. Then the `Memory` module returns the data to be read on `mem_rdata`. Symetrically, the `Processor` module has a `mem_addr` signal (as `output` this time), a `mem_rdata` signal (as input) and a `mem_rstrb` signal (as output): ```verilog module Processor ( input clk, input resetn, output [31:0] mem_addr, input [31:0] mem_rdata, output mem_rstrb, output reg [31:0] x1 ); ... endmodule ``` (in addition, we have a `x1` signal that contains the contents of register `x1`, that can be used for visual debugging. We will plug it to the LEDs). The state machine has one additional state: ```verilog localparam FETCH_INSTR = 0; localparam WAIT_INSTR = 1; localparam FETCH_REGS = 2; localparam EXECUTE = 3; case(state) FETCH_INSTR: begin state <= WAIT_INSTR; end WAIT_INSTR: begin instr <= mem_rdata; state <= FETCH_REGS; end FETCH_REGS: begin rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id]; state <= EXECUTE; end EXECUTE: begin if(!isSYSTEM) begin PC <= nextPC; end state <= FETCH_INSTR; end endcase ``` _Note_ we will see later how to simplify it and get back to three states. Now, `mem_addr` and `mem_rstrb` can be wired as follows: ```verilog assign mem_addr = PC; assign mem_rstrb = (state == FETCH_INSTR); ``` And finally, everything is installed and connected in the `SOC` ```verilog module SOC ( input CLK, // system clock input RESET, // reset button output [4:0] LEDS, // system LEDs input RXD, // UART receive output TXD // UART transmit ); wire clk; wire resetn; Memory RAM( .clk(clk), .mem_addr(mem_addr), .mem_rdata(mem_rdata), .mem_rstrb(mem_rstrb) ); wire [31:0] mem_addr; wire [31:0] mem_rdata; wire mem_rstrb; wire [31:0] x1; Processor CPU( .clk(clk), .resetn(resetn), .mem_addr(mem_addr), .mem_rdata(mem_rdata), .mem_rstrb(mem_rstrb), .x1(x1) ); assign LEDS = x1[4:0]; // Gearbox and reset circuitry. Clockworks #( .SLOW(19) // Divide clock frequency by 2^19 ) CW ( .CLK(CLK), .RESET(RESET), .clk(clk), .resetn(resetn) ); assign TXD = 1'b0; // not used for now endmodule ``` Now you can run [step11.v](step11.v) in the simulator. As expected, it does the same thing as in the previous step (counts on the LEDs from 0 to 31 and stops). What about running it on the device ? Wow, even worse, 1341 LUTs (and we only got 1280 of them on the IceStick). So let us shrink our code to make it fit ! ## Step 12: Size optimization: the Incredible Shrinking Core. _Tribute to "the Incredible Shrinking Man" classic movie_ There are many things we can do for shrinking this core. Let us first take a look at the ALU. It can compute addition, subtraction, and comparisons. Can't we reuse the result of subtraction for comparisons ? Sure we can, but to do that we need to compute a 33 bits subtraction, and test the sign bit. Matthias Koch (@Mecrisp) explained me this trick, that is also used in swapforth/J1 (another small RISC core that works on the IceStick). The 33 bits subtract is written as follows: ```verilog wire [32:0] aluMinus = {1'b0,aluIn1} - {1'b0,aluIn2}; ``` if you want to know what `A-B` does in Verilog, it corresponds to `A+~B+1` (negate all the bits of B before adding, and add 1), it is how two's complement subtraction works. For instance, take `4'b0000 - 4'b0001`, the result is `-1`, encoded as `4'b1111`. It is computed as follows by the formula: `4'b0000 + ~4'b0001 + 1` = `4'b0000 + 4'b1110 + 1` = `4'b1111`. So we will keep the following expression (we could have kept the simpler form above, but it is interesting to be aware of what happens under the scene): ```verilog wire [32:0] aluMinus = {1'b1, ~aluIn2} + {1'b0,aluIn1} + 33'b1; ``` Then we can create the wires for the three tests (this saves three 32-bit adders): ``` wire EQ = (aluMinus[31:0] == 0); wire LTU = aluMinus[32]; wire LT = (aluIn1[31] ^ aluIn2[31]) ? aluIn1[31] : aluMinus[32]; ``` - The first one, `EQ`, goes high when `aluIn1` and `aluIn2` have the same value, or `aluMinus == 0` (no need to test the 33-rd bit) - the second one, `LTU`, corresponds to unsigned comparison. It is given by the sign bit of our 33-bits subtraction. - for the third one, there are two cases: if the signs differ, then `LT` goes high if `aluIn1` is negative, else it is given by the sign bit of our 33-bits subtraction. Of course, we still need one adder for addition: ```verilog wire [31:0] aluPlus = aluIn1 + aluIn2; ``` Then, `aluOut` is computed as follows: ```verilog reg [31:0] aluOut; always @(*) begin case(funct3) 3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus; 3'b001: aluOut = aluIn1 << shamt;; 3'b010: aluOut = {31'b0, LT}; 3'b011: aluOut = {31'b0, LTU}; 3'b100: aluOut = (aluIn1 ^ aluIn2); 3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) : ($signed(aluIn1) >> shamt); 3'b110: aluOut = (aluIn1 | aluIn2); 3'b111: aluOut = (aluIn1 & aluIn2); endcase end ``` Let us try on the IceStick. Yes ! 1167 LUTs, it fits ! But it is not a good reason to stop there, there are still several opportunities to shrink space. Let us take a look at `takeBranch`, can't we reuse the `EQ`,`LT`,`LTU` signals we just created ? Sure we can: ```verilog reg takeBranch; always @(*) begin case(funct3) 3'b000: takeBranch = EQ; 3'b001: takeBranch = !EQ; 3'b100: takeBranch = LT; 3'b101: takeBranch = !LT; 3'b110: takeBranch = LTU; 3'b111: takeBranch = !LTU; default: takeBranch = 1'b0; endcase end ``` For this to work, we also need to make sure that `rs2` is routed to the second ALU input also for branches: ```verilog wire [31:0] aluIn2 = isALUreg | isBranch ? rs2 : Iimm; ``` What does it give on the device ? 1094 LUTs, not that bad, but let us continue... The jump target for `JALR` is `rs1+Iimm`, and we created an adder especially for that, it is stupid because the ALU already computes that. OK let us reuse it: ```verilog wire [31:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm : isJALR ? {aluPlus[31:1],1'b0}: PCplus4; ``` How do we stand now ? 1030 LUTs. And it is not finished: what eats-up the largest number of LUTs is the shifter, and we have three of them in the ALU (one for left shifts, one for logical right shifts and one for arithmetic right shifts). By another sorcerer's trick indicated by by Matthias Koch (@mecrisp), it is possible to merge the two right shifts, by creating a 33 bits shifter with the additional bit set to 0 or 1 depending on input's bit31 and on whether it is a logical shift or an arithmetic shift. ```verilog wire [31:0] shifter = $signed({instr[30] & aluIn1[31], shifter_in}) >>> aluIn2[4:0]; ``` Even better, Matthias told me it is possible to use in fact a single shifter, by flipping the input and flipping the output if it is a left shift: ```verilog wire [31:0] shifter_in = (funct3 == 3'b001) ? flip32(aluIn1) : aluIn1; wire [31:0] leftshift = flip32(shifter); ``` The ALU then looks like that: ```verilog reg [31:0] aluOut; always @(*) begin case(funct3) 3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus; 3'b001: aluOut = leftshift; 3'b010: aluOut = {31'b0, LT}; 3'b011: aluOut = {31'b0, LTU}; 3'b100: aluOut = (aluIn1 ^ aluIn2); 3'b101: aluOut = shifter; 3'b110: aluOut = (aluIn1 | aluIn2); 3'b111: aluOut = (aluIn1 & aluIn2); endcase end ``` Where do we stand now ? 887 LUTs my friend ! _Note 1_ well, in fact one can gain even more space with the shifter, by shifting 1 single bit at each clock. The ALU then becomes a little bit more complicated (multi-cycle), but much much smaller (Femtorv32-quark uses this trick). We will see that later. _Note 2_ with a multi-cycle ALU, we could also have a single 33-bits adder, and compute subtractions in three cycles, by separating the computation of `~aluIn2`, `aluIn1+(~aluIn2)` and `aluIn1+(~aluIn2)+1`. Before then, another easy win is factoring the adder used for address computation, as follows: ```verilog wire [31:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] : instr[4] ? Uimm[31:0] : Bimm[31:0] ); wire [31:0] PCplus4 = PC+4; ``` Then these two adders can be used by both `nextPC` and `writeBackData`: ```verilog assign writeBackData = (isJAL || isJALR) ? (PCplus4) : (isLUI) ? Uimm : (isAUIPC) ? PCplusImm : aluOut; assign writeBackEn = (state == EXECUTE && !isBranch); wire [31:0] nextPC = (isBranch && takeBranch || isJAL) ? PC+Imm : isJALR ? {aluPlus[31:1],1'b0} : PCplus; ``` The verdict ? 839 LUTs (we have gained another 50 LUTs or so...). There is still room for gaining more LUTs (by using a multi-cycle ALU for shifts, and by using a smaller number of bits for address computation), but we'll keep that for later, since we have now enough room on the device for the next steps. ## Step 13: subroutines (version 1, using plain RISC-V instructions) OK, so now we have an (uncomplete) RISC-V processor, a SOC, both fit on the device. Remember, we are approaching the end, only 8 instructions to go (5 Load variants, 3 Store variants). | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [*] 2 | [ *] 6 | [*] | [*] | [ ] 5 | [ ] 3 | [*] 1 | Before attacking them, let us learn a bit more on RISC-V assembly, and function calls. Up to now, we have used a gearbox to slow down the CPU in such a way we can observe it executing our programs. Could'nt we implement a `wait` function instead and call it ? Let us see how to do that. First thing to do is to remove the `#(.SLOW(nnn))` parameter in the `Clockworks` instanciation: ```verilog Clockworks CW( .CLK(CLK), .RESET(RESET), .clk(clk), .resetn(resetn) ); ``` this no longer generates a gearbox and directly wires the `CLK` signal of the board to the internal `clk` signal used by our design. OK, so now we need to see two different things: - how to write a function that waits for some time - how to call it _Wait a minute_ you are talking about function calls, but we do not have `Load` / `Store` instructions. We won't be able to push the return address on the stack (because we cannot read/write memory, and the stack is in memory !), so how is it possible ? There would many possible ways of using RISC-V instructions to implement function calls. To make sure everybody uses the same convention, there is an **application binary interface** that defines how to call functions, how to pass parameters, and which register does what. See [this document](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md) for more details. **Calling a function** In this document, we learn that for calling a function, the return address will be stored in `x1`. Hence one can call a function using `JAL(x1,offset)` where `offset` is the (signed) difference between the program counter and the address of the function to be called. This works provided the offset fits in 20 bits (Jimm format). _Note_: for function that are further away, one can use a combination of `AUIPC` and `JALR` to reach an arbitrary offset. **Returning from a function** is done by jumping to the address stored in `x1`, which can be done by `JALR(x0,x1,0)`. **Function arguments and return value**: The first 6 function arguments are passed through `x10`..`x16`, and the return value is passed through `x10` (it overwrites the first function argument). That's interesting, even though we do not have `Load`/`Store`, we can write programs with functions, but we cannot write functions that call other functions, because this requires saving `x1` to the stack (well in fact nothing forbids us from doing that by saving `x1` in another register but then it would quickly become a mess, so we won't do that). One little thing: we have just learnt that in the ABI, `x1` is used to store the return address of functions. Up to know we have wired it to the LEDs. Since we are going now to comply with the ABI, we need to chose another register instead. From now, `x10` will be wired to the LEDs. OK, so now we have everything we need to write yet another version of the blinky ! Let us chose a `slow_bit` constant, wire a `wait` function that counts to `2^slow_bit`, and call it to slow-down our blinky: ```verilog `ifdef BENCH localparam slow_bit=15; `else localparam slow_bit=19; `endif `include "riscv_assembly.v" integer L0_ = 4; integer wait_ = 20; integer L1_ = 28; initial begin ADD(x10,x0,x0); Label(L0_); ADDI(x10,x10,1); JAL(x1,LabelRef(wait_)); // call(wait_) JAL(zero,LabelRef(L0_)); // jump(l0_) EBREAK(); // I keep it systematically // here in case I change the program. Label(wait_); ADDI(x11,x0,1); SLLI(x11,x11,slow_bit); Label(L1_); ADDI(x11,x11,-1); BNE(x11,x0,LabelRef(L1_)); JALR(x0,x1,0); endASM(); end always @(posedge clk) begin if(mem_rstrb) begin mem_rdata <= MEM[mem_addr[31:2]]; end end endmodule ``` Try [step13.v](step13.v) in simulation and on the device. **Try this** Knight-driver blinky, with one routine for going from left to right, another routine for going from right to left, and the wait routine. _Hint_ you will need to save `x1` to another register. ## Step 14: subroutines (version 2, using RISC-V ABI and pseudo-instructions) With the ABI, we have a standard way of writing programs, but there are many things to remember: - all RISC-V registers are the same, but with the ABI, we need to use certain registers for certain tasks (`x1` for return address, `x10`..`x16` for function parameters, etc...); - calling a function is implemented using `JAL` or `AUIPC` and `JALR`, and returning from a function is implemented using `JALR`. On a CISC processor, there are often special functions for calling functions (`CALL`) and for returning from a function (`RET`), and registers are often specialized (function return address, stack pointer, function parameters). This makes programmer's life easier because there is less to remember. There is no reason not doing the same for a RISC processor ! Let us pretend that the register are different and give them different names (or aliases). These names are listed [here](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md#general-registers). | ABI name | name | usage | |-------------------|------|---------------------------------------------| | `zero` | `x0` | read:0 write:ignored | | `ra` | `x1` | return address | | `t0`...`t6` | ... | temporary registers | | `fp`,`s0`...`s11` | ... | saved registers, `fp`=`so`: frame pointer | | `a0`...`a7` | ... | function parameters and return value (`a0`) | | `sp` | `x2` | stack pointer | | `gp` | `x3` | global pointer | Saved registers (`s0`, ... `s11`) are supposed to be left untouched or saved/restored by functions. You can put your local variables there. If you write a function, you are supposed to push the ones you use on the stack and pop them before returning. For all the other registers, you cannot expect them to be preserved through function calls. The global pointer `gp` can be used as a "shortcut" to reach memory areas that are far away in 1 instruction. We will see that later (once we have `Load` and `Store`). In our VERILOG assembler [riscv_assembly.v](riscv_assembly.v), we just need to declare these aliases for register names: ```verilog localparam zero = x0; localparam ra = x1; localparam sp = x2; localparam gp = x3; ... localparam t4 = x29; localparam t5 = x30; localparam t6 = x31; ``` Besides these names, there are also _pseudo-instructions_ for common tasks, such as: | pseudo-instruction | action | |-----------------------|--------------------------------------| | `LI(rd,imm)` | loads a 32-bits number in a register | | `CALL(offset)` | calls a function | | `RET()` | return from a function | | `MV(rd,rs)` | equivalent to `ADD(rd,rs,zero)` | | `NOP()` | equivalent to `ADD(zero,zero,zero)` | | `J(offset)` | equivalent to `JAL(zero,offset)` | | `BEQZ(rd1,offset)` | equivalent to `BEQ(rd1,x0,offset)` | | `BNEZ(rd1,offset)` | equivalent to `BNE(rd1,x0,offset)` | | `BGT(rd1,rd2,offset)` | equivalent to `BLT(rd2,rd1,offset)` | If the constant in the [-2048,2047] range, `LI` is implemented using `ADDI(rd,x0,imm)`, else it uses a combination of `LUI` and `ADDI` (if you want to know how it works, see this [stackoverflow answer](https://stackoverflow.com/questions/50742420/risc-v-build-32-bit-constants-with-lui-and-addi), there are tricky details about sign expansion). Using ABI register names and pseudo-instructions, our program becomes as follows: ```verilog integer L0_ = 4; integer wait_ = 24; integer L1_ = 32; initial begin LI(a0,0); Label(L0_); ADDI(a0,a0,1); CALL(LabelRef(wait_)); J(LabelRef(L0_)); EBREAK(); Label(wait_); LI(a1,1); SLLI(a1,a1,slow_bit); Label(L1_); ADDI(a1,a1,-1); BNEZ(a1,LabelRef(L1_)); RET(); endASM(); end ``` It does not make a huge difference, but in longer programs, it improves legibility by showing the intent of the programmer (this one is a function, that one is a jump to a label etc...). Without it, since everything looks like the same, reading a program is more difficult. It is quite funny: the RISC-V standard has a super-simple instruction set, but programming with it is not that easy, so the ABI pretends that the instruction set is more complicated, like a CISC processor, and this makes programmer's life easier. It also ensures that a function written by a programmer can be called from a function written by another programmer, possibly in a different language. We will see later how to use GNU assembler and C compiler to compile programs for our CPU. But before playing with software and toolchains, remember, we still have 8 instructions to implement in hardware (5 `Load` variants and 3 `Store` variants). **Try this** invent (or copy it from [somewhere else](https://github.com/riscv-collab/riscv-gcc/blob/5964b5cd72721186ea2195a7be8d40cfe6554023/libgcc/config/riscv/muldi3.S)) a routine to multiply two numbers, test it on various inputs in simulation, and on the device. ## Step 15: Load Let us see now how to implement load instructions. There are 5 different instructions: | Instruction | Effect | |-----------------|--------------------------------------------------------------| | LW(rd,rs1,imm) | Load word at address (rs1+imm) into rd | | LBU(rd,rs1,imm) | Load byte at address (rs1+imm) into rd | | LHU(rd,rs1,imm) | Load half-word at address (rs1+imm) into rd | | LB(rd,rs1,imm) | Load byte at address (rs1+imm) into rd then sign extend | | LH(rd,rs1,imm) | Load half-word at address (rs1+imm) into rd then sign extend | _Note_ addresses are aligned on word boundaries for `LW` (multiple of 4 bytes) and halfword boundaries for `LH`,`LHU` (multiple of 2 bytes). It is a good thing, it makes things much easier for us... But we still have some work to do ! First, some circuitry that determines the loaded value (that we will call `LOAD_data`). As you can see, we got instructions for loading words, half-words and bytes, and instructions that load half-words and bytes exist in two versions: - `LBU`,`LHU` that load a byte,halfword in the LSBs of `rd` - `LB`,`LH` that load a byte,halfword in the LSBs of `rd` then do sign extensin: For instance, imagine a sign byte with the value `-1`, that is `8'b11111111`, loading it in a 32-bit register with `LBU` will result in `32'b0000000000000000000000011111111`, whereas loading it with `LB` will result in `32'b11111111111111111111111111111111`, that is, the 32-bits version of `-1`. So we got a "two-dimensional" array of cases (whether we load a byte, halfword, word, and whether we do sign extension or not). Well, in fact it is even more complicated. Remember, our memory is structured into words, so when we load a byte, we need to know which one it is (among 4), and when we load a halfword, we need to know which one it is (among 2). This can be done by examining the 2 LSBs of the address of the data to be loaded (`rs1 + Iimm`): ```verilog wire [31:0] loadstore_addr = rs1 + Iimm; wire [15:0] LOAD_halfword = loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0]; wire [7:0] LOAD_byte = loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0]; ``` OK, so now we need to select among `mem_rdata` (`LW`), `LOAD_halfword` (`LH`,`LHU`) and `LOAD_byte` (`LB`,`LBU`). Examining the table in the [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf) page 130, this is determined by the two LSBs of `funct3`: ```verilog wire mem_byteAccess = funct3[1:0] == 2'b00; wire mem_halfwordAccess = funct3[1:0] == 2'b01; wire [31:0] LOAD_data = mem_byteAccess ? LOAD_byte : mem_halfwordAccess ? LOAD_halfword : mem_rdata ; ``` Now we need to insert sign expansion into this expression. The value to be written in the MSBs of `rd`, `LOAD_sign`, depends on both whether the instruction does sign expansion (`LB`,`LH`), characterized by `funct3[2]=0`, and the MSB of the loaded value: ```verilog wire LOAD_sign = !funct3[2] & (mem_byteAccess ? LOAD_byte[7] : LOAD_halfword[15]); wire [31:0] LOAD_data = mem_byteAccess ? {{24{LOAD_sign}}, LOAD_byte} : mem_halfwordAccess ? {{16{LOAD_sign}}, LOAD_halfword} : mem_rdata ; ``` Pfiuuuu, it was a bit painful, but in the end it is not too complicated. My initial design was much more complicated, but Matthias Koch (@mecrisp) simplified it a lot, resulting in the (reasonably easy to understand) design above. We are not completely done though, now we need to modify the state machine. It will have two additional states, `LOAD` and `WAIT_DATA`: ```verilog localparam FETCH_INSTR = 0; localparam WAIT_INSTR = 1; localparam FETCH_REGS = 2; localparam EXECUTE = 3; localparam LOAD = 4; localparam WAIT_DATA = 5; reg [2:0] state = FETCH_INSTR; ``` _Note 1_ we could do with a smaller number of states, but for now our goal is to have something that works and that is as easy to understand as possible. We will see later how to simplify the state machine. _Note 2_ do not forget to check that `state` has the required number of bits ! (`reg [2:0] state` instead of `reg [1:0] state` as before !!). Then the new states are plugged in as follows: ```verilog ... EXECUTE: begin if(!isSYSTEM) begin PC <= nextPC; end state <= isLoad ? LOAD : FETCH_INSTR; end LOAD: begin state <= WAIT_DATA; end WAIT_DATA: begin state <= FETCH_INSTR; end ... ``` And finally, the signals `mem_addr` (with the address to be read) and `mem_rstrb` (that goes high whenever the processor wants to read data) are driven as follows: ```verilog assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ? PC : loadstore_addr ; assign mem_rstrb = (state == FETCH_INSTR || state == LOAD); ``` Let us test now our new instructions with the following program: ```verilog integer L0_ = 8; integer wait_ = 32; integer L1_ = 40; initial begin LI(s0,0); LI(s1,16); Label(L0_); LB(a0,s0,400); // LEDs are plugged on a0 (=x10) CALL(LabelRef(wait_)); ADDI(s0,s0,1); BNE(s0,s1, LabelRef(L0_)); EBREAK(); Label(wait_); LI(t0,1); SLLI(t0,t0,slow_bit); Label(L1_); ADDI(t0,t0,-1); BNEZ(t0,LabelRef(L1_)); RET(); endASM(); // Note: index 100 (word address) // corresponds to // address 400 (byte address) MEM[100] = {8'h4, 8'h3, 8'h2, 8'h1}; MEM[101] = {8'h8, 8'h7, 8'h6, 8'h5}; MEM[102] = {8'hc, 8'hb, 8'ha, 8'h9}; MEM[103] = {8'hff, 8'hf, 8'he, 8'hd}; end ``` This program initializes some values in four words at address 400, and loads them in `a10` in a loop. There is also a delay loop (`wait` function) to let you see something, just as before. **Try this** Run the program in simulation and on the device. Test the other instructions. Do a programmable tinsel as in step 3. **You are here !** Just three instructions to go and we will be done ! | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [*] 2 | [*] 6 | [*] | [*] | [*] 5 | [ ] 3 | [*] 1 | ## Step 16: Store We are approaching the end, but still some work to do, to implement the following three instructions: | Instruction | Effect | |-----------------|-----------------------------------------| | SW(rs2,rs1,imm) | store rs2 at address rs1+imm | | SB(rs2,rs1,imm) | store 8 LSBs of rs2 at address rs1+imm | | SH(rs2,rs1,imm) | store 16 LSBs of rs2 at address rs1+imm | To do so, we will need to do three different things: - modify the interface between the processor and the memory in such a way that the processor can write to the memory - the memory is addressed by words. Each write operation will modify a word. But `SB` and `SH` need to be able to write individual bytes. Besides the word to be written, we need to compute which byte of this word should be effectively modified in memory (a 4-bits mask) - the state machine needs to be modified. The `Memory` module is modified as follows: ``` verilog module Memory ( input clk, input [31:0] mem_addr, output reg [31:0] mem_rdata, input mem_rstrb, input [31:0] mem_wdata, input [3:0] mem_wmask ); reg [31:0] MEM [0:255]; initial begin ... end wire [29:0] word_addr = mem_addr[31:2]; always @(posedge clk) begin if(mem_rstrb) begin mem_rdata <= MEM[word_addr]; end if(mem_wmask[0]) MEM[word_addr][ 7:0 ] <= mem_wdata[ 7:0 ]; if(mem_wmask[1]) MEM[word_addr][15:8 ] <= mem_wdata[15:8 ]; if(mem_wmask[2]) MEM[word_addr][23:16] <= mem_wdata[23:16]; if(mem_wmask[3]) MEM[word_addr][31:24] <= mem_wdata[31:24]; end ``` We have two new input signals: `mem_wdata`, a 32-bits signal with the value to be written, and `mem_wmask` a 4-bits signal that indicates which byte should be written. _Note_ you may wonder how it is implemented in practice, in particular how the masked write to memory is synthesized on the device. BRAMs on most FPGAs directly support masked writes, through vendor's special primitives. Yosys has a (super smart) special step called "technology mapping" that detects some patterns in the source VERILOG file, and instances the vendor's primitive best adapted to the usage. In fact technology mapping was used before in our tutorial, to represent the registers bank: at each cycle we read two registers, `rs1` and `rs2`. In the IceStick, BRAMs can read a single value at each clock, so to make it possible, yosys automatically duplicates the register bank. Whenever a value is written to `rd`, it is written to the two register banks: `bank1[rdId] <- writeBackValue; bank2[rdId] <- writeBackValue;`, and two different registers can be read at the same cycle, each one in its own register bank `rs1 <- bank1[rs1Id]; rs2 <- bank2[rs2Id;`. With the magic of Yosys, you do not have to take care of this, it will automatically select the best mapping for you (duplicated register bank, single register bank with two read ports if target supports it, or even array of flipflops with address decoder for larger FPGAs with many LUTs). In our case, the IceStick has an Ice40HX1K, that has 8 kB of BRAM, organized in 8 blocks of 1 kB each. Two of them are used for the (duplicated) register bank, leaving 6 kB of BRAM that we use to synthesize system RAM. The `Processor` module is updated accordingly: ```verilog module Processor ( input clk, input resetn, output [31:0] mem_addr, input [31:0] mem_rdata, output mem_rstrb, output [31:0] mem_wdata, output [3:0] mem_wmask, output reg [31:0] x10 = 0 ); ``` (and everything is connected in the `SOC`). Let us see now how to compute the word to be written and the mask. The address where the value should be written is still `rs1 + imm`, but the format of the immediate value is different between `Load` (`Iimm`) and `Store` (`Simm`): ``` wire [31:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm); ``` Now the data to be written depends on whether we write a byte, a halfword or a word, and for bytes and halfwords, also depends on the 2 LSBs of the address. Interestingly, we do not need to test whether we write a byte, a halfword or a word, because the write mask (see lated) will ignore MSBs for byte and halfword write: ``` assign mem_wdata[ 7: 0] = rs2[7:0]; assign mem_wdata[15: 8] = loadstore_addr[0] ? rs2[7:0] : rs2[15: 8]; assign mem_wdata[23:16] = loadstore_addr[1] ? rs2[7:0] : rs2[23:16]; assign mem_wdata[31:24] = loadstore_addr[0] ? rs2[7:0] : loadstore_addr[1] ? rs2[15:8] : rs2[31:24]; ``` And finally, the 4-bits write mask, that indicate which byte of `mem_wdata` should be effectively written to memory. It is determined as follows: | write mask | Instruction |----------------------------------------------|------------------------------------------| | `4'b1111` | `SW` | | `4'b0011` or `4'b1100` | `SH`, depending on `loadstore_addr[1]` | | `4'b0001`, `4'b0010`, `4'b0100` or `4'b1000` | `SB`, depending on `loadstore_addr[1:0]` | Deriving the expression is a bit painful. With Matthias Koch we ended up with this one: ```verilog wire [3:0] STORE_wmask = mem_byteAccess ? (loadstore_addr[1] ? (loadstore_addr[0] ? 4'b1000 : 4'b0100) : (loadstore_addr[0] ? 4'b0010 : 4'b0001) ) : mem_halfwordAccess ? (loadstore_addr[1] ? 4'b1100 : 4'b0011) : 4'b1111; ``` Let us now create additional states in the state machine: ```verilog localparam FETCH_INSTR = 0; localparam WAIT_INSTR = 1; localparam FETCH_REGS = 2; localparam EXECUTE = 3; localparam LOAD = 4; localparam WAIT_DATA = 5; localparam STORE = 6; ... always @(posedge clk) begin ... case(state) ... EXECUTE: begin if(!isSYSTEM) begin PC <= nextPC; end state <= isLoad ? LOAD : isStore ? STORE : FETCH_INSTR; LOAD: begin state <= WAIT_DATA; end WAIT_DATA: begin state <= FETCH_INSTR; end STORE: begin state <= FETCH_INSTR; end endcase end end ``` The signals interfaced with the memory as driven as follows: ```verilog assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ? PC : loadstore_addr ; assign mem_rstrb = (state == FETCH_INSTR || state == LOAD); assign mem_wmask = {4{(state == STORE)}} & STORE_wmask; ``` And, at last, a little thing: do not write back to register bank if instruction is a `Store` ! ```verilog assign writeBackEn = (state==EXECUTE && !isBranch && !isStore && !isLoad) || (state==WAIT_DATA) ; ``` _Note_ The `!isLoad` term that prevents writing `rd` during `EXECUTE` can be removed from the condition, since `rd` will be overwritten right after during the `WAIT_DATA`. It is there to have something easier to understand with simulations. **try this** Run [step16.v](step16.v) in simulation and on the device. It copies 16 bytes from address 400 to address 800, then displays the values of the copied bytes. **You are here !** Congratulations ! You have finished implementing your first RV32I RISC-V core ! | ALUreg | ALUimm | Jump | Branch | LUI | AUIPC | Load | Store | SYSTEM | |--------|--------|-------|--------|-----|-------|-------|-------|--------| | [*] 10 | [*] 9 | [*] 2 | [*] 6 | [*] | [*] | [*] 5 | [*] 3 | [*] 1 | _But wait a minute_ for sure we have worked a lot to implement a RISC-V core, but all what I can see know is just something that looks like the stupid blinky at step 1 ! I want to see more ! To do so, we need to let our device communicate with the outside word with more than 5 LEDs. ## Step 17: Memory-mapped device - let's do (much) more than a blinky ! Now the idea is to add devices to our SOC. We already have LEDs, that are plugged to register `a0` (`x10`). Plugging devices on a register like that is not super elegant, it would be better to have a special address in memory that is not really actual RAM but that has a register plugged to the LEDs. With this idea, one can add as many devices as he likes, by assigning a virtual address to each device. Then the SOC will have address decoding hardware that routes the data to the right device. As you will see, besides removing from the processor the wires drawn from `x10` to the LEDS, this only requires some small modifications in the SOC. Before starting to modify the SOC, the first thing to do is to decide about the "memory map", that is, which address space portion corresponds to what. In our system, we have 6 kB of RAM, so in practice we could say that addresses between 0 and 2^13-1 (8 kB, let us keep a power of two) correspond to RAM. I decided to use a larger portion of address space for RAM (because we also have FPGAs that have ampler quantities of BRAM), then the address space dedicated to RAM will be between 0 and 2^22-1 (that is, 4 MB of RAM). Then, I decided to say that if bit 22 is set in an address, then this address corresponds to a device. Now we need to specify how to select among multiple devices. A natural idea is to use bits 0 to 21 as a "device index", but doing so is going to require multiple 22-bits wide comparators, and on our IceStick, it will eat-up a significant portion of the removing LUTs. A better idea, suggested (once again) by Matthias Koch (@mecrisp), is to use 1-hot encoding, that is, data is routed to device number `n` if bit `n` is set in the address. We will only consider "word addresses" (that is, ignore the two LSBs). Doing that, we can only plug 20 different devices to our SOC, but it is still much more than what we need. The advantage is that it dramatically simplifies address decoding, in such a way that everything still fits in the IceStick. To determine whether a memory request should be routed to the RAM or to the devices, we insert the following circuitry into the SOC: ```verilog wire [31:0] RAM_rdata; wire [29:0] mem_wordaddr = mem_addr[31:2]; wire isIO = mem_addr[22]; wire isRAM = !isIO; wire mem_wstrb = |mem_wmask; ``` The RAM is wired as follows: ```verilog Memory RAM( .clk(clk), .mem_addr(mem_addr), .mem_rdata(RAM_rdata), .mem_rstrb(isRAM & mem_rstrb), .mem_wdata(mem_wdata), .mem_wmask({4{isRAM}}&mem_wmask) ); ``` (note the `isRAM` signal ANDed with the write mask) Now we can add the logic to wire our LEDs. They are declared as a `reg` in the SOC module interface: ```verilog module SOC ( input CLK, input RESET, output reg [4:0] LEDS, input RXD, output TXD ); ``` driven by a simple block: ```verilog localparam IO_LEDS_bit = 0; always @(posedge clk) begin if(isIO & mem_wstrb & mem_wordaddr[IO_LEDS_bit]) begin LEDS <= mem_wdata; end end ``` Now we can write (yet another version of) our old good blinky: ```verilog LI(gp,32'h400000); LI(a0,0); Label(L1_); SW(a0,gp,4); CALL(LabelRef(wait_)); ADDI(a0,a0,1); J(LabelRef(L1_)); ``` First we load the base address of the IO page in `gp` (that is, `2^22`). To write LEDs value, we store `a0` to word address 1 (that is address 4) in the IO page. To make things easier when we'll have several devices (right after), let us write some helper functions: ```verilog // Memory-mapped IO in IO page, 1-hot addressing in word address. localparam IO_LEDS_bit = 0; // W five leds // Converts an IO_xxx_bit constant into an offset in IO page. function [31:0] IO_BIT_TO_OFFSET; input [31:0] bit; begin IO_BIT_TO_OFFSET = 1 << (bit + 2); end endfunction ``` Then we can write to the LEDs as follows: ```verilog SW(a0,gp,IO_BIT_TO_OFFSET(IO_LEDS_bit)); ``` _OK, is it all what you have, still your stupid blinky after 17 (!) tutorial steps ?_ Sure, you are right man. Let us add an UART to allow our core to display stuff to a virtual terminal. The IceStick (and many other FPGA boards) has a special chip (FTDI2232H if you want to know), that translates between the plain old RS232 serial protocol and USB. It is good news for us, because RS232 is a simple protocol, much easier to implement than USB. In fact, our core will communicate with the outside word through two pins (one for sending data, called `TXD` and one for receiving data, called `RXD`), and the FTDI chip converts to the USB protocol for you. Moreover, it is a good idea not reinventing the wheel, and there are many existing implementation of UART (Universal Asynchronous Receiver Transmitter, that implement the RS232 protocol) in VERILOG. For our purpose, for now we will only implement half of it (that is, the part that lets our processor send data over it to display text in a terminal emulator). Olof Kindren has written a [Tweet-size UART](https://twitter.com/OlofKindgren/status/1409634477135982598), more legible version [here](https://gist.github.com/olofk/e91fba2572396f55525f8814f05fb33d). Let us insert it into our SOC and connect it: ```verilog // Memory-mapped IO in IO page, 1-hot addressing in word address. localparam IO_LEDS_bit = 0; // W five leds localparam IO_UART_DAT_bit = 1; // W data to send (8 bits) localparam IO_UART_CNTL_bit = 2; // R status. bit 9: busy sending ... wire uart_valid = isIO & mem_wstrb & mem_wordaddr[IO_UART_DAT_bit]; wire uart_ready; corescore_emitter_uart #( .clk_freq_hz(`BOARD_FREQ*1000000), .baud_rate(115200) ) UART( .i_clk(clk), .i_rst(!resetn), .i_data(mem_wdata[7:0]), .i_valid(uart_valid), .o_ready(uart_ready), .o_uart_tx(TXD) ); wire [31:0] IO_rdata = mem_wordaddr[IO_UART_CNTL_bit] ? { 22'b0, !uart_ready, 9'b0} : 32'b0; assign mem_rdata = isRAM ? RAM_rdata : IO_rdata ; ``` The UART is projected onto two different addresses in memory space. The first one, that can be only written to, sends one character. The second one, that can be only read from, indicates whether the UART is ready (bit 9 = 0) or busy sending a character (bit 9 = 1). Now our processor has more possibilities to communicate with the outside world than the poor five LEDs we had before ! Let us implement a function to send a character: ```verilog Label(putc_); // Send character to UART SW(a0,gp,IO_BIT_TO_OFFSET(IO_UART_DAT_bit)); // Read UART status, and loop until bit 9 (busy sending) // is zero. LI(t0,1<<9); Label(putc_L0_); LW(t1,gp,IO_BIT_TO_OFFSET(IO_UART_CNTL_bit)); AND(t1,t1,t0); BNEZ(t1,LabelRef(putc_L0_)); RET(); ``` It writes the character to the UART address projected in IO space, then loops while the UART status indicates that it is busy sending a character. **Try this** run [step17.v](step17.v) in simulation. _Wait a minute_ in simulation, how does it know how to display something ? It's because I cheated a bit, I added the following block of code to the SOC: ```verilog `ifdef BENCH always @(posedge clk) begin if(uart_valid) begin $write("%c", mem_wdata[7:0] ); $fflush(32'h8000_0001); end end `endif ``` (the magic constant argument to`$fflush()` corresponds to `stdout`, you need to do that else you do not see anything on the terminal until the output buffer of `stdout` is full). Doing so we do not test the UART in simulation (it is completely bypassed). I trust Olof that it works fine, but to do things properly, it would be better to plug something on the simulated `TXD` signal, decode the RS232 protocol and display the characters (we'll see examples of this type of simulation later on). **Try this** run [step17.v](step17.v) on device. To display what's sent to the UART, use: ``` $ ./terminal.sh ``` _Note_ edit `terminal.sh` and chose your favourite terminal emulator in there. You may also need to change `DEVICE=/dev/ttyUSB1` according to your local configuration. ## Step 18: Computing the Mandelbrot set Now that we have a functional RISC-V processor and a SOC with an UART that can send characters to a virtual terminal, let us rest a little bit with a purely software step. In this step, we are going to write a program in RISC-V assembly that computes a crude, ASCII-art version of the Mandelbrot set. Our "image" will be made of 80x80 characters. So let us start by writing a program that fills the image with "*" characters. To do that, we will use two nested loops. The Y coordinate will be stored in `s0` and the X coordinate in `s1`. The upper bound (80) will be stored in `s11`. The program looks like that: ```verilog LI(gp,32'h400000); // IO page LI(s1,0); LI(s11,80); Label(loop_y_); LI(s0,0); Label(loop_x_); LI(a0,"*"); CALL(LabelRef(putc_)); ADDI(s0,s0,1); BNE(s0,s11,LabelRef(loop_x_)); LI(a0,13); CALL(LabelRef(putc_)); LI(a0,10); CALL(LabelRef(putc_)); ADDI(s1,s1,1); BNE(s1,s11,LabelRef(loop_y_)); EBREAK(); ``` (and we copy the `putc` function from the previous example). **Fixed point** So now we want to compute the Mandelbrot set. To do that, we need to manipulate real numbers. Unfortunately, our super simplistic RISC-V core is not able to directly manipulate floating point numbers. The C compiler's support library `libgcc` has some functions to support them, but we will see later how to use them. For now, the idea is to compute the Mandelbrot set using fixed-point numbers, that is, in an integer number, we will use some bits to represent the fractional part (10 bits in our case), and some bits to represent the integer parts (22 bits in our case). In other words, it means that if we want to represent a real number `x`, we will store (the integer part of) `x*2^10` in a register. It is similar to floating point numbers, except that the exponent in our case is always 10. We will use the following constants in our program: ```verilog `define mandel_shift 10 `define mandel_mul (1 << `mandel_shift) ``` Now, to compute the sum or the difference of two numbers, it does not change anything, because the `2^10` factor is the same for both numbers to be added (or subtracted). For a product it is a different story, because when you compute `x*y`, the actual computation that you do is `x*2^10*y*2^10`, so what you get is `(x*y)*2^20`, and you wanted `(x*y)*2^10`, so you need to divide by `2^10` (right shift by `10`). OK, that's good, but how do we compute the product of two integer numbers stored in two registers ? Our processor has no `MUL` instruction ? In fact it is possible to add a `MUL` instruction (it is part of the RV32M instruction set), we will see that later, but it will not fit within our tiny IceStick ! So what can we do ? We can implement a function that takes two numbers in `a0` and `a1`, computes their products and returns it in `a0`. The C compiler support library `libgcc` has one (it is what is used when compiling C for small RV32I RISC-V processors that do not have the `MUL` instruction, like ours). The source-code of this function is [here](https://github.com/riscv-collab/riscv-gcc/blob/5964b5cd72721186ea2195a7be8d40cfe6554023/libgcc/config/riscv/muldi3.S). Let us port it to our VERILOG RISC-V assembler (that has a slightly different syntax unfortunately, we will see later how to directly use gcc and gas): ```verilog // Mutiplication routine, // Input in a0 and a1 // Result in a0 Label(mulsi3_); MV(a2,a0); LI(a0,0); Label(mulsi3_L0_); ANDI(a3,a1,1); BEQZ(a3,LabelRef(mulsi3_L1_)); ADD(a0,a0,a2); Label(mulsi3_L1_); SRLI(a1,a1,1); SLLI(a2,a2,1); BNEZ(a1,LabelRef(mulsi3_L0_)); RET(); ``` (do not forget to declare the new labels before the `initial` block). So now, before displaying the Mandelbrot set, to test our fixed-point computation idea, let us display a simpler shape, that is, we consider we are visualizing the `[-2.0,2.0]x[-2.0,2.0]` square (mapped to our 30x30 characters display), and we want to display a disk of radius `2` centered on `(0,0)`. To do that, we need first to compute the (fixed point) coordinates `x,y`. They will be stored in `s2` and `s3`. Then we need to compute `x^2+y^2`. We can do that by invoking the `mulsi3` routine twice (do not forget to rightshift the result by 10). Finally, we compare the result with `4 << 10` (4 because it is the _squared_ radius, and shifted to the left by 10 because of our fixed-point representation), to decide whether the point was inside or outside the disk, and use a different character to display it. The corresponding program looks like that: ```verilog `define mandel_shift 10 `define mandel_mul (1 << `mandel_shift) `define xmin (-2*`mandel_mul) `define xmax ( 2*`mandel_mul) `define ymin (-2*`mandel_mul) `define ymax ( 2*`mandel_mul) `define dx ((`xmax-`xmin)/30) `define dy ((`ymax-`ymin)/30) `define norm_max (4 << `mandel_shift) integer loop_y_ = 28; integer loop_x_ = 36; integer in_disk_ = 92; initial begin LI(gp,32'h400000); // IO page LI(s1,0); LI(s3,`xmin); LI(s11,30); LI(s10,`norm_max); Label(loop_y_); LI(s0,0); LI(s2,`ymin); Label(loop_x_); MV(a0,s2); MV(a1,s2); CALL(LabelRef(mulsi3_)); SRLI(s4,a0,`mandel_shift); // s4 = x*x MV(a0,s3); MV(a1,s3); CALL(LabelRef(mulsi3_)); SRLI(s5,a0,`mandel_shift); // s5 = y*y ADD(s6,s4,s5); // s6 = x*x+y*y LI(a0,"*"); BLT(s6,s10,LabelRef(in_disk_)); // if x*x+y*y < 4 LI(a0," "); Label(in_disk_); CALL(LabelRef(putc_)); ADDI(s0,s0,1); ADDI(s2,s2,`dx); BNE(s0,s11,LabelRef(loop_x_)); LI(a0,13); CALL(LabelRef(putc_)); LI(a0,10); CALL(LabelRef(putc_)); ADDI(s1,s1,1); ADDI(s3,s3,`dy); BNE(s1,s11,LabelRef(loop_y_)); EBREAK(); ``` and the output looks like that: ``` *********** *************** ****************** ********************* *********************** ************************ ************************* *************************** *************************** ***************************** ***************************** ***************************** ***************************** ***************************** ***************************** ***************************** ***************************** ***************************** ***************************** ***************************** *************************** *************************** ************************* ************************* *********************** ********************* ******************* *************** *********** ``` Now to compute the Mandelbrot set, we need to iterate the following operation: ``` Z <- 0; iter <- 0 do Z <- Z^2 + C iter <- iter + 1 while |Z| < 2 ``` where `Z` and `C` are complex numbers. `C = x + iy` corresponds to the current pixel. Remember the rule for complex number multiplication (`i*i = -1`), we can compute `Z^2 = (Zr + i*Zi)^2 = Zr^2-Zi^2 + 2*i*Zr*Zi`. The loop that computes these iterates writes: ```verilog Label(loop_Z_); MV(a0,s4); // Zrr <- (Zr*Zr) >> mandel_shift MV(a1,s4); CALL(LabelRef(mulsi3_)); SRLI(s6,a0,`mandel_shift); MV(a0,s4); // Zri <- (Zr*Zi) >> (mandel_shift-1) MV(a1,s5); CALL(LabelRef(mulsi3_)); SRAI(s7,a0,`mandel_shift-1); MV(a0,s5); // Zii <- (Zi*Zi) >> (mandel_shift) MV(a1,s5); CALL(LabelRef(mulsi3_)); SRLI(s8,a0,`mandel_shift); SUB(s4,s6,s8); // Zr <- Zrr - Zii + Cr ADD(s4,s4,s2); ADD(s5,s7,s3); // Zi <- 2Zri + Cr ADD(s6,s6,s8); // if norm > norm max, exit loop LI(s7,`norm_max); BGT(s6,s7,LabelRef(exit_Z_)); ADDI(s10,s10,-1); // iter--, loop if non-zero BNEZ(s10,LabelRef(loop_Z_)); Label(exit_Z_); ``` in the end, we display different characters depending on the value of `iter` (`s10`) when the loop is exited: ``` Label(exit_Z_); LI(a0,colormap_); ADD(a0,a0,s10); LBU(a0,a0,0); CALL(LabelRef(putc_)); ``` where the "colormap" is an array of characters that mimic different "intensities", from the darkest to the brightest: ``` Label(colormap_); DATAB(" ",".",",",":"); DATAB(";","o","x","%"); DATAB("#","@", 0 , 0 ); ``` ![](https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/TUTORIALS/Images/mandelbrot_terminal.gif) **Try that** run [step18.v](step18.v) in simulation and on the device. Modify it to draw your own graphics (for instance, try drawing "concentric circles" using the "colormap"). ## Step 19: Faster simulation with Verilator As you have seen in Step 18, simulation is much much slower than running the design on the device. However, there is another tool, called `verilator`, that lets you convert a VERILOG design into C++. Then you compile the C++, and you have a simulation that is much much faster than icarus/iverilog. Let us first install verilator: ``` $ apt-get install verilator ``` Before transforming our design into C++, we will have to create a "bench", that is, some C++ code that will generate the signals for our design, and that will declare the C++ `main()` function. The main role of the main function is to declare an object of class `VSOC` (generated from our `SOC` module), and wiggle its `CLK` signal. Each time the `CLK` signal is changed, you need to call the `eval()` function to take the change into account. The `sim_main.cpp` file is as follows: ```c++ #include "VSOC.h" #include "verilated.h" #include int main(int argc, char** argv, char** env) { VSOC top; top.CLK = 0; while(!Verilated::gotFinish()) { top.CLK = !top.CLK; top.eval(); } return 0; } ``` In addition, in [sim_main.cpp](sim_main.cpp), there is some code to decode whenever the LEDs change, and display their status. To convert a design to C++, use the following command: ``` $ verilator -DBENCH -DBOARD_FREQ=12 -Wno-fatal --top-module SOC -cc -exe sim_main.cpp step18.v ``` Then to compile the C++ and run the generated program: ``` $ cd obj_dir $ make -f VSOC.mk $ ./VSOC ``` As you can see, it is much much faster than icarus/iverilog ! For a small design, it does not make a huge difference, but believe me, when you are developping an RV32IMFC core, with a FPU, it is good to have efficient simulation ! To make things easier, there is a `run_verilator.sh` script, that you can invoke as follows: ``` $ run_verilator.sh step18.v ``` ## Step 20: Using the GNU toolchain to compile programs - assembly At this step, you may have the feeling that our RISC-V design is just a toy, for educational purpose, far away from "the real thing". In fact, at this step, you will start feeling that what you have done is as real as any other RISC-V processor ! What makes a processor interesting is the software you can run on it, hence if our thingy can run any software written for a (RV32I) RISC-V processor, then it is a RV32I RISC-V processor. _Wait a minute_ but what we have used up to now to write the software is the VERILOG assembler, it is just a toy, different from the real thing no ? In fact, the VERILOG assembler generates exactly the same machine code as any other RISC-V assembler. We coud use instead any other RISC-V assembler, load the generated machine code into our design and run it ! To do so, VERILOG has a `$readmemh()` command, that loads the data to initialize a memory from an external file. It is used as follows in [step20.v](step20.v): ```verilog initial begin $readmemh("firmware.hex",MEM); end ``` where `firmware.hex` is an ASCII file with the initial content of `MEM` in hexadecimal. So if we want to use an external assembler, all we have to do is figure out the following things: - how to compile RISC-V assembly code using GNU tools - how to tell GNU tools about the device we have created (RAM start address, RAM amount) - how to convert the output of GNU tools into a file that `$readmemh()` can understand OK, let us start with a simple blinker, in [blinker.S](FIRMWARE/blinker.S): ``` # Simple blinker .equ IO_BASE, 0x400000 .equ IO_LEDS, 4 .section .text .globl start start: li gp,IO_BASE li sp,0x1800 .L0: li t0, 5 sw t0, IO_LEDS(gp) call wait li t0, 10 sw t0, IO_LEDS(gp) call wait j .L0 wait: li t0,1 slli t0, t0, 17 .L1: addi t0,t0,-1 bnez t0, .L1 ret ``` As you can see, it is very similar to the code we wrote up to now in the VERILOG assembler. In this program, we have three different things: - **main program** - **utilities**, here the `wait` function - **setup**, that is, initializing `gp` and `sp` So we will split the file into three parts: - [FIRMWARE/blinker.S](FIRMWARE/blinker.S) with the `main` function - [FIRMWARE/wait.S](FIRMWARE/wait.S) with the `wait` function - [FIRMWARE/start.S](FIRMWARE/start.S) with the setup code, that calls `main` in the end. To compile it, you will need to install the RISC-V toolchain (compiler, assembler, linker) on your machine. Our makefile can do that for you: ``` $ cd learn-fpga/FemtoRV $ make ICESTICK.firmware_config ``` _Note:_ always use `ICESTICK.firmware_config`, even if you have a larger board, it will configure the makefiles for `RV32I` build (and that's what our processor supports). This will download some files and unpack them in `learn-fpga/FemtoRV/FIRMWARE/TOOLCHAIN`. Add the `riscv64-unknown-elf-gcc..../bin/` directory to your path. Now to compile our program: ``` $ cd learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/FIRMWARE $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax start.S -o start.o $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax blinker.S -o blinker.o $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax wait.S -o wait.o ``` We specify the architecture (`rv32i`) that corresponds to the instructions supported by our processor and the ABI (`ilp32`) that corresponds to the way functions are called. THe `no-relax` option concerns the `gp` register that we use for accessing the IO page (so we do not let the assembler use it for anything else). This generates object files (`.o`). We now need to generate an executable from them, by invoking the linker. The linker will determine where our code and data should be implanted in memory. For that, we need to specify how the memory in our device is organized, in a linker script ([FIRMWARE/bram.ld](FIRMWARE/bram.ld)): ``` MEMORY { BRAM (RWX) : ORIGIN = 0x0000, LENGTH = 0x1800 /* 6kB RAM */ } SECTIONS { everything : { . = ALIGN(4); start.o (.text) *(.*) } >BRAM } ``` A linker script contains a description of `MEMORY`. In our case, there is a single segment of 6 kB of memory, that we call `BRAM`. It starts from address `0x0000`. Then we have `SECTIONS`, that indicates what goes where (or which segment goes to which memory). In our case, it is super simple: everything goes to BRAM. We also indicate that the content of `start.o` should be installed first in memory. The linker is invoked as follows: ``` $ riscv64-unknown-elf-ld blinker.o wait.o -o blinker.bram.elf -T bram.ld -m elf32lriscv -nostdlib -norelax ``` It generates an "elf" executable ("elf" stands for Executable and Linkable Format). It is the same format as the binaries in a Linux system. The option `-T bram.ld` tells it to use our linker script. The option `-m elf32lriscv` indicates that we are generating a 32-bits executable. We are not using the C stdlib for now (`-nostdlib`) and we keep `gp` for ourselves (`-norelax`). We do not need to have `start.o` on the command line in the list of objects to link, because it is already included in the linker script `bram.ld`. We are not completely done, now we need to extract the relevant information from the elf executable, and generate a file with all the machine code in hexadecimal, so that VERILOG's `$readmemh()` function can understand it. For that, I wrote a `firmware_words` utility, that understands the elf file formats, extracts the parts that are interesting for us and writes them in ASCII hexadecimal: ``` $ make blinker.bram.hex ``` _Note_ you can invoke `make xxxx.bram.hex` directly, it will invoke the assembler, linker and elf conversion utility for you automatically. Now you can run the example in simulation and on the device: ``` $ cd .. $ ./run_verilator.sh step20.v $ BOARDS/run_xxx.sh step20.v ``` Now that things are easier, we can write more complicated programs. Let us see how to write the famous "hello world" program. What we need is a `putstring` routine to display a string on the tty. It takes as input the address of the first character of the string to display in `a0`. We just need to loop on all characters of the string, and exit the loop as soon as we find a null character, and call `putchar` for each character: ``` # Warning, buggy code ahead ! putstring: mv t2,a0 .L2: lbu a0,0(t2) beqz a0,.L3 call putchar addi t2,t2,1 j .L2 .L3: ret ``` Have you seen the comment ? It means the code above has an error, can you spot it ? A hint, `putstring` is a function that calls a function. Don't we need to do special in this case ? Do you remember what `call` and `ret` do ? Yes, `call` stores `PC+4` in `ra` then jumps to the function, and `ret` jumps to the address in `ra`. Now suppose that somebody called our `putstring` function. When we enter the function, `ra` contains the address we are supposed to jump to when reaching the `ret` statement in `putstring`. But inside `putstring`, we call `putchar`, and it overwrites `ra` with the address right after the call, so that `putchar` will be able to jump there when it will return, but `putstring` will jump there as well, which is not what we want. To avoid that, we need to save `ra` at the beginning of `putstring`, and restore it at the end. To do that, we use the stack as follows: ``` putstring: addi sp,sp,-4 # save ra on the stack sw ra,0(sp) # (need to do that for functions that call functions) mv t2,a0 .L2: lbu a0,0(t2) beqz a0,.L3 call putchar addi t2,t2,1 j .L2 .L3: lw ra,0(sp) # restore ra addi sp,sp,4 # resptore sp ret ``` The function can be used as follows: ``` la a0, hello call putstring ... hello: .asciz "Hello, world !\n" ``` The `la` (load address) pseudo-instruction loads the address of the string in `a0`. The string is declared with a standard label, and the `.asciz` directive that generates a zero-terminated string. **Try this** Compile `hello.S` (`cd FIRMWARE; make hello.bram.hex`) and test it in simulation and on device. Try also `mandelbrot.S`. As you can see, [FIRMWARE/mandelbrot.S](FIRMWARE/mandelbrot.S) does not have the `__mulsi` function. If you take a look at [FIRMWARE/Makefile](FIRMWARE/Makefile), the executable is linked with the right version of `libgcc.a` (for RV32I), that has it. Now you can start having a feeling that your processor is a real thing: when you run the Mandelbrot example, it executes code on *your* processor that was written by *somebody else*. Can we go further and run code generated by standard tools ? ## Step 21: Using the GNU toolchain to compile programs - C Let us see now how we can write code in C for our processor. At this point, we are able to generate object files (`.o`) and produce an elf executable from them using the linker. Our linker script ensures that everything goes at the right place in memory, then our processor can execute the code, first the content of `start.S`, implanted at address 0, that calls in turn the `main` function. Up to now our programs were completely written in assembly. The nice thing with the ABI (Application Binary Interface), that we have seen at steps 13 and 14, is that it makes it possible to combine object files (`.o`) produced by different tools, as soon as they respect the ABI, which is the case (of course) of the C compiler. The example [FIRMWARE/sieve.c](FIRMWARE/sieve.c), taken from the examples in picorv is a good candidate. It is interesting, it does multiplications, divisions and modulos using integer numbers. These operations are not implemented by our RV32I core, but they are supported by the compiler using functions in `libgcc.a`, and since we link with `libgcc.a`, this will work. However, the program also uses `printf()` to display the result, and this function is declared in `libc.a`. In principle, it would be possible to use it, but `printf()` supports so many formats that its code is too large and will not fit in our 6 kB or RAM. For this reason, we include a much smaller / much simpler version in [FIRMWARE/print.c](FIRMWARE/print.c) (also taken from picorv), and included in the objects to be linked with executables. ![](mandel_and_riscvlogo.png) There are two other examples, a C version of the Mandelbrot program: [FIRMWARE/mandel_C.c](FIRMWARE/mandel_C.c). It uses [ANSI colors](https://stackoverflow.com/questions/4842424/list-of-ansi-color-escape-sequences) to display low-resolution "graphics" in the terminal. There is also [FIRMWARE/riscv_logo.c](FIRMWARE/riscv_logo.c) that displays a spinning Risc-V logo (in a 90-ish demoscene style !). **Try this** Compile `sieve.c` (`cd FIRMWARE; make sieve.bram.hex`) and test it in simulation (`./run_verilator.sh step20.v`) and on device (`BOARDS/run_xxx.sh step20.v; ./terminal.sh`). Try the other programs. Write your own programs (if you do not have an idea, try for instance cellular automata, Life ...). Note: the Verilator framework can directly load ELF executables in simulation (no need to regenerate `firmware.hex`). You can generate all demo programs: `cd FIRMWARE; make hello.bram.elf mandelbrot.bram.elf mandel_C.bram.elf riscv_logo.bram.elf;cd ..`, then run the one that you want using `./run_verilator.sh step20.v FIRMWARE/mandel_C.bram.elf` or `./obj_dir/FIRMWARE/mandel_C.bram.elf`. Now you can see that your processor is not just a toy, it is a real RISC-V processor on which you can run programs produced by standard tools ! _Note_ on the IceStick, we only have `6kB` of RAM, so only tiny programs will fit. If the compiled program is larger than `6kB` then you will get an error. A more problematic case is a program that nearly fills the whole BRAM, then we have nearly no space for the stack, and the stack will overwrite the rest, putting the CPU in an invalid state, probably frozen. This situation is difficult to understand / to debug when you encounter it, so `firmware_words` displays a big warning message whenever the generated code fills more than 95% of the BRAM. ## Step 22: Storing data: can I have more than 6 kB of memory ? _and some optimizations in the processor_ ![](IceStick_SPIFLASH.jpg) On the IceStick, there are only 8 blocks of 1 kB of BRAM, and since we need to use two of them for the registers, this leaves only 6 kB of RAM for our programs. It is sufficient for small programs like Mandelbrot or little graphic demos, but you will very soon reach the limit. The IceStick has a little chip (see figure) with 4 MBs of FLASH memory (other boards have a similar chip). When you synthesize a design, it is stored in this FLASH memory. On startup, the FPGA loads its configuration from this chip. The nice thing is that the FPGA configuration takes no more than a few kilobytes, this leaves us a lot of space to store our own data. But we will need to create some additional hardware to communicate with this chip. As you can see on the figure, this chip only has 8 legs, how can we address 4 MBs of data using 8 pins only ? In fact, this chip uses a _serial protocol_ (SPI). To access data, one sends the address to be read on a pin, one bit at a time, then the chip sends the data back on another pin, one bit at a time. If you want to learn more about it, my notes about SPI flash are [here](https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/TUTORIALS/spi_flash.md) and the VERILOG implementation is in [spi_flash.v](spi_flash.v). It supports different protocols, depending on the used number of pins and whether pins are bidirectional. The `MappedSPIFlash` module has the following interface: ```verilog module MappedSPIFlash( input wire clk, input wire rstrb, input wire [19:0] word_address, output wire [31:0] rdata, output wire rbusy, output wire CLK, output reg CS_N, inout wire [1:0] IO ); ``` | signal | description | |--------------|----------------------------------------------------------------| | clk | system clock | | rstrb | read strobe, goes high whenever processor wants to read a word | | word_address | address of the word to be read | | rdata | data read from memory | | rbusy | asserted if busy receiving data | | CLK | clock pin of the SPI flash chip | | CS_N | chip select pin of the SPI flash chip, active low | | IO | two bidirectional pins for sending and receiving data | Now the idea is to modify our SOC in such a way that some addresses correspond to the SPI flash. First we need to decide how it will be projected into the memory space of our processor. The idea is to use bit 23 of memory addresses to select the SPI Flash. Bit 22 is for IO (LEDs, UART). In addition, for IO, we need to check that bit 23 is zero. And if both bits 23 and 22 are zero, then we are in BRAM. So our memory space is decomposed into four "quadrants" depending on bits 23 and 22, and we use three of them. Then we have the different signals to discriminate the different zones of our memory: ```verilog wire isSPIFlash = mem_addr[23]; wire isIO = mem_addr[23:22] == 2'b01; wire isRAM = mem_addr[23:22] == 2'b00; ``` The `MappedSPIFlash` module is wired as follows: ```verilog wire SPIFlash_rdata; wire SPIFlash_rbusy; MappedSPIFlash SPIFlash( .clk(clk), .word_address(mem_wordaddr), .rdata(SPIFlash_rdata), .rstrb(isSPIFlash & mem_rstrb), .rbusy(SPIFlash_rbusy), .CLK(SPIFLASH_CLK), .CS_N(SPIFLASH_CS_N), .IO(SPIFLASH_IO) ); ``` (the pins `SPIFLASH_CLK`, `SPIFLASH_CS_N`, `SPIFLASH_IO[0]` and `SPIFLASH_IO[1]` are declared in the constraint file, in the `BOARDS` subdirectory). The data sent to the processor has a three-ways mux: ```verilog assign mem_rdata = isRAM ? RAM_rdata : isSPIFlash ? SPIFlash_rdata : IO_rdata ; ``` OK, now our processor can automatically trigger a SPI flash read by accessing memory with bit 23 set in the address, but how does it know that data is ready ? (remember, data arrives one bit at a time). There is this `SPIFlash_rbusy` that goes high whenever `MappedSPIFlash` is busy receiving some data, we need to take it into account in our processor's state machine. We add a new input signal `mem_rbusy` to our processor, and modify the state machine as follows: ```verilog ... WAIT_DATA: begin if(!mem_rbusy) begin state <= FETCH_INSTR; end end ... ``` Then, in the SOC, this signal is wired to `SPIFlash_rbusy`: ```verilog wire mem_rbusy; ... Processor CPU( ... .mem_rbusy(mem_rbusy), ... ); ... assign mem_rbusy = SPIFlash_rbusy; ``` By the way, since we are revisiting the state machine, there is something we can do. Remember this portion of the state machine, don't you think we could go faster ? ```verilog WAIT_INSTR: begin instr <= mem_rdata; state <= FETCH_REGS; end FETCH_REGS: begin rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id]; state <= EXECUTE; end ``` Yes, `rs1Id` and `rs2Id` are simply 5 wires (each) drawn from `instr`, so we can get them from `mem_rdata` directly, and fetch the registers in the `WAIT_INSTR` state, as follows: ```verilog WAIT_INSTR: begin instr <= mem_rdata; rs1 <= RegisterBank[mem_rdata[19:15]]; rs2 <= RegisterBank[mem_rdata[24:20]]; state <= EXECUTE; end ``` Doing so we gain one cycle per instruction, and it is an easy win ! Oh, and one more thing, why do we need a `LOAD` and a `STORE` state, could'nt we initiate memory transfers in the `EXECUTE` state ? Yes we can, so we need to change the write mask and read strobes accordingly, like that: ```verilog assign mem_rstrb = (state == FETCH_INSTR || (state == EXECUTE & isLoad)); assign mem_wmask = {4{(state == EXECUTE) & isStore}} & STORE_wmask; ``` Then the state machine has 4 states only ! ```verilog localparam FETCH_INSTR = 0; localparam WAIT_INSTR = 1; localparam EXECUTE = 2; localparam WAIT_DATA = 3; reg [1:0] state = FETCH_INSTR; always @(posedge clk) begin if(!resetn) begin PC <= 0; state <= FETCH_INSTR; end else begin if(writeBackEn && rdId != 0) begin RegisterBank[rdId] <= writeBackData; end case(state) FETCH_INSTR: begin state <= WAIT_INSTR; end WAIT_INSTR: begin instr <= mem_rdata; rs1 <= RegisterBank[mem_rdata[19:15]]; rs2 <= RegisterBank[mem_rdata[24:20]]; state <= EXECUTE; end EXECUTE: begin if(!isSYSTEM) begin PC <= nextPC; end state <= isLoad ? WAIT_DATA : FETCH_INSTR; end WAIT_DATA: begin if(!mem_rbusy) begin state <= FETCH_INSTR; end end endcase end end ``` There are several other things that we can optimize. First thing, you may have noticed that the two LSBs of the instructions are always `2'b11` in RV32I, so we do not need to load them: ```verilog reg [31:2] instr; ... instr <= mem_rdata[31:2]; ... wire isALUreg = (instr[6:2] == 5'b01100); ... ``` Something else: we are doing all address computations with 32 bits, whereas our address space has 24 bits only, we can save significant resources there: ```verilog localparam ADDR_WIDTH=24; wire [ADDR_WIDTH-1:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] : instr[4] ? Uimm[31:0] : Bimm[31:0] ); wire [ADDR_WIDTH-1:0] PCplus4 = PC+4; wire [ADDR_WIDTH-1:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm : isJALR ? {aluPlus[31:1],1'b0} : PCplus4; wire [ADDR_WIDTH-1:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm); ``` The up to date verilog file is avalaible in [step22.v](step22.v). Let us now check that we are able to access the SPI flash from our processor, with the following [program](FIRMWARE/read_spiflash.c): ```C #include "io.h" #define SPI_FLASH_BASE ((char*)(1 << 23)) int main() { for(int i=0; i<16; ++i) { IO_OUT(IO_LEDS,i); int lo = (int)SPI_FLASH_BASE[2*i ]; int hi = (int)SPI_FLASH_BASE[2*i+1]; print_hex_digits((hi << 8) | lo,4); // print four hexadecimal digits printf(" "); } printf("\n"); } ``` The SPI flash is mapped in memory space, using addresses with bit 23 set (the first address, that we call `SPI_FLASH_BASE`, is `1 << 23`). Then we access all individual bytes, and display them by grouping them into 16-bit words (for each word, the first byte in memory is the least significant one, because RISC-V follows the little-endian convention). We have a `print_hex_digits()` function in [FIRMWARE/print.c](FIRMWARE/print.c) that does the job (the second argument is the number of hex characters we want to print for each number). Now compile the program, synthesize the design and send it to the device as follows: ``` $ cd FIRMWARE $ make read_spiflash.bram.hex $ cd .. $ BOARDS/run_icestick.sh step22.v $ ./terminal.sh ``` ... and you see nothing. While is this so ? The program finished before you started the terminal, so we were not able to see anything, but you can reset the processor, pushing the invisible reset button (mentioned in [step 2](README.md#step-2-slower-blinky)). Each time you push the "button", it will display on the terminal the first 16 words stored in the SPI flash. On a IceStick, you will see something like: ``` 00FF FF00 AA7E 7E99 0051 0501 0092 6220 4B01 0072 8290 0000 0011 0101 0000 0000 ``` Do you have an idea where these values come from ? Remember why there is this SPI flash chip on your FPGA board: it is where your design is stored. When the FPGA starts, it loads its design from the SPI flash. The design corresponds to the file `SOC.bin`, that is generated at the end of the `yosys/nextpnr/icepack` pipeline: - `yosys` transforms your verilog into a "circuit", also called a "netlist" - then `nextpnr` maps the gates of this circuit to the logical elements of the FPGA, - and finally `icepack` converts the result into a "binary stream" directly understood by the FPGA. Let us examine the 16 first words of the binary stream: ``` $ od -x -N 32 SOC.bin ``` Then you'll see something like: ``` 0000000 00ff ff00 aa7e 7e99 0051 0501 0092 6220 0000020 4b01 0072 8290 0000 0011 0101 0000 0000 0000040 ``` and this corresponds to what we have just seen on the terminal, read from the SPI flash chip. So our CPU can read its own FPGA representation from the SPI flash, like a biologist sequencing his hown DNA ! While it has a nice and intriguing recursion flavor, it is probably of very little practical use, but let us take a deeper look at it: the `SOC.bin` file is not very large: ``` $ ls -al SOC.bin -rw-rw-r-- 1 blevy blevy 32220 Jan 7 07:31 SOC.bin ``` It weights only `32KB` or so, and our SPI flash chip has capacity for `4MB`, so there is plenty of room for us ! The only thing we need to take care of is not overwriting the FPGA configuration (in other words, always start further away then the size of `SOC.bin`). So we will use a `1MB` offset for storing our data (you will say we are wasting a lot of space between `32KB` and `1MB` but we shall use that space for something else in subsequent steps of this tutorial). **Try this** Create a text file `hello.txt`, send it to the FPGA at the `1MB` offset (see below how to do that), write a program that displays the stored file. To know where to stop, you may need either to decide for a termination character or to precode the length of the file. For ICE40 boards (IceStick, IceBreaker, ...), use: ``` $ iceprog -o 1M hello.txt ``` For ECP5 boards (ULX3S), use: ``` $ cp hello.txt hello.img $ ujprog -j flash -f 1048576 hello.img ``` (using latest version of `ujprog` compiled from [https://github.com/kost/fujprog](https://github.com/kost/fujprog)). ![](ST_NICCC_tty.png) OK, so now we are ready to use the new storage that we have for more interesting things. What we will do is displaying an animation on the terminal. The animation is a demo from the 90's, that streams polygon data to a software polygon renderer. Polygon data is a 640 kB binary file, available from `learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat` (see other files in the same directory for more information about the file format). First thing to do is writing the file to the SPI flash, from a 1MBytes offset. For ICE40-based boards (IceStick, IceBreaker), use: ``` $ iceprog -o 1M learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat ``` For ECP5 boards (ULX3S), use: ``` $ cp learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat scene1.img $ ujprog -j flash -f 1048576 scene1.img ``` (using latest version of `ujprog` compiled from [https://github.com/kost/fujprog](https://github.com/kost/fujprog)). Now you can compile the program: ``` $ cd FIRMWARE $ make ST_NICCC.bram.hex $ cd .. ``` and send the design and the program to the device: ``` $ BOARDS/run_xxx.sh step22.v $ ./terminal.sh ``` **Try this** Store an image in SPI Flash (in a format that is easy to read), and write a program to display it. You can use `printf("\033[48;2;%d;%d;%dm ",R,G,B);` to send a pixel (where `R`,`G`,`B` are numbers between 0 and 255), and `printf("\033[48;2;0;0;0m\n");` after each scanline. ## Step 23: running programs from SPI Flash, first steps With what we have done in the previous step, we are now able to load data from the SPI flash, and we have ample space for all our data, but we still have only 6 kB that is shared between our code and variables, it is not much ! It would be great to be able to use the SPI flash to store our code, and execute it directly from there. We were able to write nice demos that fit in 6 kB, imagine what you could do with 2 MB for code, and the entire 6 kB available for your variables ! To be able to load code from the SPI flash, the only thing we need to change is staying in the `WAIT_INSTR` state until `mem_rbusy` is zero, hence we just need to test `mem_rbusy` before changing `state` to `EXECUTE`: ```verilog WAIT_INSTR: begin instr <= mem_rdata[31:2]; rs1 <= RegisterBank[mem_rdata[19:15]]; rs2 <= RegisterBank[mem_rdata[24:20]]; if(!mem_rbusy) begin state <= EXECUTE; end end ``` and we initialize the BRAM with the following program, that jumps to address `0x00820000`: ```verilog initial begin LI(a0,32'h00820000); JR(a0); end ``` This address corresponds to the address where the SPI flash is projected into the address space of our CPU (`0x00800000` = 1 << 23) plus an offset of 128kB (`0x20000`). This offset of 128 kB is necessary because remember, we share the SPI Flash with the FPGA that stores its configuration in it ! OK, that's mostly it for the hardware part. Let us see now if we can execute code from there. To do that, we will need a new linker script ([FIRMWARE/spiflash0.ld](FIRMWARE/spiflash0.ld)): ``` MEMORY { FLASH (RX) : ORIGIN = 0x00820000, LENGTH = 0x100000 /* 4 MB in flash */ } SECTIONS { everything : { . = ALIGN(4); start.o (.text) *(.*) } >FLASH } ``` It is the same thing as before, but we tell the linker to put everything in flash memory (for now, we will see later how it works for global variables). Let us test it with a program that does not write to global variables, for instance [FIRMWARE/hello.S](FIRMWARE/hello.S). To link it using our new linker script, we do: ``` $ riscv64-unknown-elf-ld -T spiflash0.ld -m elf32lriscv -nostdlib -norelax hello.o putchar.o -o hello.spiflash0.elf ``` But since it is tedious to type, it is automated by the Makefile: ``` $ make hello.spiflash0.elf ``` Now you need to convert the ELF executable into a flat binary: ``` $ riscv64-unknown-elf-objcopy hello.spiflash0.elf hello.spiflash0.bin -O binary ``` or with our Makefile: ``` $ make hello.spiflash0.bin ``` and send it to the SPI flash at offset 128k: ``` $ iceprog -o 128k hello.spiflash0.bin ``` or with our Makefile: ``` $ make hello.spiflash0.prog ``` and then: ``` $ ./terminal.sh ``` ## Step 24: running programs from SPI Flash, a better linker script Before starting, let us make a little change in our core: when pushing the reset button, it jumps at address 0, which is initialized as a jump to flash memory, but after executing our program, it is possible (and highly probable) that the RAM will have been used for something else, and no longer has the jump-to-flash instruction. To fix this, one can make the CPU jump to flash memory each time reset goes low: ```verilog if(!resetn) begin PC <= 32'h00820000; state <= WAIT_DATA; end ``` Note that state is set to WAIT_DATA, so that it waits for `mem_rbusy` to go low before doing anything else. OK, so now we have a large quantity of flash memory in which we can install the code and run it from there. We can also install readonly variables in there, like the string `.asciz "Hello, world !\n"` in the previous example. And what about local variables ? They are allocated on the stack, that resides in the 6 kB of RAM that we have, so it will work. How does it know where the stack is ? Remember, we have written [FIRMWARE/start.S](FIRMWARE/start.S), that initializes `sp` at the end of the RAM (`0x1800`) and it suffices. But how does it work for a program like that ? ```C int x = 3; void main() { x = x + 1; printf("%d\n",x); } ``` The global variable `x` has an initial value that needs to be stored somewhere, so we need to put it in flash memory, but we are modifying it after, so we need to put it in RAM, how is it possible ? In fact, what we need is a mechanism for storing all the initial values of the (initialized) global variables in flash memory and copy them to RAM on startup. To do that, we will need a new linker script (that indicates where to put the variables and where to put their initial values) and a new `start.S` (that copies the initial values to the variables). Let us see how to do that. When you compile C code, the compiler inserts directives to indicate where the different things go (sections). To take a look, generate assembly from one of our C programs: ``` $ cd FIRMWARE $ make ST_NICCC.o $ readelf -S ST_NICCC.o ``` it will show you the different sections that are present in the object file. | section | description | |-------------|--------------------| | text | executable code | | bss, sbss | uninitialized data | | data, sdata | read-only data | | rodata | read-only data | The section name (bss) for uninitialized data has an historic reason that dates back to the 60's (BSS: Block Started by Symbol is a pseudo-instruction of an assembler for the IBM 704). Uninitialized and initialized data sections come in two flavor, sbss and sdata is for small uninitialized (resp) initialized) data. In `readelf` output, there is also a `type` field. `PROGBIT` means that some data needs to be loaded from the file (for `text`, `data` and `rodata`) segments. `NOBITS` means that no data should be loaded (for `bss`). Then the `Addr` indicates where the section will be mapped into memory (for a `.o` file, it is always 0, but it is useful for a linked elf executable, you can check using `readelf`). Then the `Offs` field indicates the offset for the section's data in the `.o` file, and the `Size` field the number of bytes in the section. So what we have to do is writing a linker script that will say the following things: - `text` sections go to the flash memory - `bss` sections go to BRAM - `data` sections go to BRAM, but have their initial values stored in the flash memory For `text` and `bss`, we already know how to do it. For `data`, linker scripts can specify a LMA (Load Memory Address), that indicates where initial values need to be stored. In our linker script, we will have something like: ``` MEMORY { FLASH (rx) : ORIGIN = 0x00820000, LENGTH = 0x100000 RAM (rwx) : ORIGIN = 0x00000000, LENGTH = 0x1800 } SECTIONS { .data: AT(address_in_spi_flash) { *(.data*) *(.sdata*) } > RAM .text : { start_spiflash1.o(.text) *(.text*) *(.rodata*) *(.srodata*) } >FLASH .bss : { *(.bss*) *(.sbss*) } >RAM } ``` Each section indicates how to map sections read from object files to sections in the executable (`.data`, `.text` and `.bss`), and how to map these sections to the flash memory and to the BRAM. For each section, some pattern matching rules indicate which sections from the object files are concerned. For the `.text` section, we make sure that the first section is the text section of `start_spiflash1.o`, because our processor jumps there on reset. Note also that we put the readonly data (`.rodata` and `.srodata`) into the flash. For the `.data` section, the `AT` keyword indicates the LMA (Load Memory Address) where the linker will put the initial values (an address in spi flash), and whenever a symbol in a `data` or `sdata` section is referenced, the linker will use its address in RAM. But a question remains: how does the system know that it should copy initialization data from the flash into BRAM ? How does it know at which address ? How can we initialize uninitialized data (BSS) to zero ? In fact we need to do it by hand, in the startup code `start_spiflash1.S`, that looks like that: ```asm .equ IO_BASE, 0x400000 .text .global _start .type _start, @function _start: .option push .option norelax li gp,IO_BASE .option pop li sp,0x1800 # zero-init bss section: la a0, _sbss la a1, _ebss bge a0, a1, end_init_bss loop_init_bss: sw zero, 0(a0) addi a0, a0, 4 blt a0, a1, loop_init_bss end_init_bss: # copy data section from SPI Flash to BRAM: la a0, _sidata la a1, _sdata la a2, _edata bge a1, a2, end_init_data loop_init_data: lw a3, 0(a0) sw a3, 0(a1) addi a0, a0, 4 addi a1, a1, 4 blt a1, a2, loop_init_data end_init_data: call main ebreak ``` - The first thing that we do is initializing the stack pointer and the general pointer `gp` (with the IO page address in our case). - the first loop clears the memory between `_sbss` and `_ebss`. - the second loop copies data from `_sidata` to `_sdata` ... `_edata` - finally we call `main` ... but wait a minute, how do we know the values for `_sbss`,`_ebss`,`_sidata`,`_sdata`,`_edata` ? In fact, the linker script can generate them for us. Here is what the `.data` section looks like: ``` .data : AT ( _sidata ) { . = ALIGN(4); _sdata = .; *(.data*) *(.sdata*) . = ALIGN(4); _edata = .; } > RAM ``` where `.` denotes the current address. In addition, lines like `. = ALIGN(4);` make sure that addresses remain aligned on 4-bytes boundaries, since our initialization loops in `start_spiflash1.S` depend on that. The declaration for the `.text` section looks like: ``` .text : { . = ALIGN(4); start_spiflash1.o(.text) *(.text*) . = ALIGN(4); *(.rodata*) *(.srodata*) _etext = .; _sidata = _etext; } >FLASH ``` note that it declares `_sidata` right at the end of the text section, so that the `.data` section can put its initialization data there. OK, so let us try it with one of our examples: ``` $ cd FIRMWARE $ make mandel_C.spiflash1.prog $ cd .. $ ./terminal.sh ``` Yes, it works, but _wait a minute_, it is significantly slower than before. Can you guess why ? Remember that the FLASH memory is a *serial* memory, wich means that addresses are sent one bit at a time and the result is obtained also one bit at a time (well, in fact two bits at a time for both in our case), it is much slower than the BRAM that gets a 32-bits value in one cycle. Can we do something ? Sure we can ! What about putting some critical functions in BRAM ? To do that, we can change our linker script as follows (result in [FIRMWARE/spiflash2.ld](FIRMWARE/spiflash2.ld)): ``` .data_and_fastcode : AT ( _sidata ) { . = ALIGN(4); _sdata = .; /* Initialized data */ *(.data*) *(.sdata*) /* integer mul and div */ */libgcc.a:muldi3.o(.text) */libgcc.a:div.o(.text) putchar.o(.text) print.o(.text) /* functions with attribute((section(".fastcode"))) */ *(.fastcode*) . = ALIGN(4); _edata = .; } > RAM ``` By doing so, we indicate that some specific functions (integer multiply and divide from libgcc and IO functions) should be put in fast RAM, and that's all we have to do ! The linker will put the code for these functions in the same section as the initialization data for initialized variables, and our runtime `start_spiflash1.S` will copies them with the initialization data to RAM at startup, cool ! Let us try it with our example: ``` $ cd FIRMWARE $ make mandel_C.spiflash2.prog $ cd .. $ ./terminal.sh ``` Aaaah, much better ! Note also the line `*(.fastcode*)`: you can put your own functions in BRAM, by indicating that they are in a `fastcode` section. In C, you can do that as follows: ```C void my_function(my args ...) __attribute((section(".fastcode"))); void my_function(my args ...) { ... } ``` **Try this** run the `ST_NICCC` demo (`make ST_NICCC.spiflash2.prog`). Then uncomment the line in `ST_NICCC.c` with the definition for `RV32_FASTCODE` and re-run it. ![](tinyraytracer_tty.png) Now we can run larger programs on our device: - [FIRMWARE/pi.c](FIRMWARE/pi.c) (by Fabrice Beillard, computes the decimals of pi) - [FIRMWARE/tinyraytracer.c](FIRMWARE/tinyraytracer.c) (by Dmitry Sokolov, raytracing) Both of them use floating point numbers. For a RV32I core such as ours, floating point numbers use routines implemented in `libgcc`. As a consequence, executables are larger (`pi` weights 17 kB and `tinyraytracer` weights 25 kB) and would have been impossible to run in 6 kB of RAM. The additional memory offered by the SPI FLASH offers much more possibilities to our device ! At this point, not only our device runs code compiled using standard tools (gcc), but also it runs existing code, independently developped (the mathematical routines in `libgcc`). It is quite exciting to run existing binary code on a processor that you create on your own ! ## Next tutorial [Pipelining](PIPELINE.md) ## Files for all the steps - [step 1](step1.v): Blinker, too fast, can't see anything - [step 2](step2.v): Blinker with clockworks - [step 3](step3.v): Blinker that loads pattern from ROM - [step 4](step4.v): The instruction decoder - [step 5](step5.v): The register bank and the state machine - [step 6](step6.v): The ALU - [step 7](step7.v): Using the VERILOG assembler - [step 8](step8.v): Jumps - [step 9](step9.v): Branches - [step 10](step10.v): LUI and AUIPC - [step 11](step11.v): Memory in separate module - [step 12](step12.v): Size optimization: the Incredible Shrinking Core ! - [step 13](step13.v): Subroutines 1 (standard Risc-V instruction set) - [step 14](step14.v): Subroutines 2 (using Risc-V pseudo-instructions) - [step 15](step15.v): Load - [step 16](step16.v): Store - [step 17](step17.v): Memory-mapped devices - [step 18](step18.v): Mandelbrot set - step 19: Faster simulation with Verilator - [step 20](step20.v): Using the GNU toolchain to compile assembly programs - step 21: Using the GNU toolchain to compile C programs - [step 22](step22.v): More memory ! Using the SPI Flash - [step 23](step23.v): Running programs from the SPI Flash, first steps - [step 24](step24.v): Running programs from the SPI Flash, better linker script _WIP_ - step 25: More devices (LED matrix, OLED screen...)