learnFPGA/LESSON1.md

# From Blinker to RISC-V

This tutorial is a progressive journey from a simple blinky design to a RISC-V core.

It works with the following boards:
- IceStick
- IceBreaker
- ULX3S
- ARTY

If you do not have a board, you can run everything in simulation (but
it is not as fun).

## About this tutorial

- it is a progressive introduction, changing only one thing at a time. It
  is a curated version of my logbook when I learnt these notions (2020-2022). I also tryed
  to keep track of all the dead ends I explored and traps that caught me, they
  are often indicated as side remarks and notes;
- I try to keep hardware requirement as minimal as possible. With the tiniest FPGA
  (IceStick Ice40HX1K) you can do the first episode of the tutorial and transform it
  into a fully functional RV32I microcontroller that can execute compiled C code.
- in the end, the obtained processor is not the most efficient, but it is not a toy:
  it can execute any program. To anwser the question you may ask, yes, it
  [runs DOOM](https://github.com/BrunoLevy/learn-fpga/tree/master/LiteX/software/Doom)!
  (but not on an IceStick, you will need a larger FPGA). It works with the help of LiteX that
  has a nice SDRAM controller, because Doom needs some RAM;
- the tutorial is both about hardware and software: you will learn how to compile programs
  in assembly and in C for your core;
- I try to make all example programs fun and interesting while reasonably short. The bundled
  demo programs include:
    - mandelbrot set in assembly and in C
    - rotozoom graphic effect
    - drawing filled polygons
    - raytracing
  These graphic program are all displayed in text mode on the terminal, using ANSI escape
  sequences (yes, this makes BIG pixels). For more fun, it is also possible to use a small OLED display
  instead (will add instructions for that in the future).
- [Episode II](PIPELINE.md) is on pipelining, you will learn there how to transform the basic processor
  obtained at the end of this tutorial into a more efficient pipelined processor with branch prediction.
- [Episode III)(INTERRUPTS.md) is a WIP on interrupts and the priviledged RISC-V ISA.
- This tutorial is in VERILOG. It is currently being ported into other HDLs
    - [Amaranth/nMigen version](https://github.com/bl0x/learn-fpga-amaranth) by @bl0x
    - TODO: Silice version
    - TODO: SpinalHDL version

## Introduction and references on processor design

To understand processor design, the first thing that I have read was
[this answer](https://stackoverflow.com/questions/51592244/implementation-of-simple-microprocessor-using-verilog/51621153#51621153)
on Stackoverflow, that I found inspiring. There is also [this article](http://www.fpgacpu.org/papers/xsoc-series-drafts.pdf) suggested by @mithro.
For a complete course, I highly recommend [this one from the MIT](http://web.mit.edu/6.111/www/f2016/), it also
gives the principles for going much further than what I've done here (pipelines etc...).

For Verilog basics and syntax, I read _Verilog by example by Blaine C. Readler_, it is also short and to the point.

There are two nice things with the Stackoverflow answer:
- it goes to the essential, and keeps nothing else than what's essential
- the taken example is a RISC processor, that shares several similarities with RISC-V
  (except that it has status flags, that RISC-V does not have).

What we learn there is that there will be a _register file_, that stores
the so-called _general-purpose_ registers. By general-purpose, we mean
that each time an instruction reads a register, it can be any of them,
and each time an instruction writes a register, it can be any of them,
unlike the x86 (CISC) that has _specialized_ registers. To implement the
most general instruction (`register <- register OP register`), the
register file will read two registers at each cycle, and optionally
write-back one.

There will be an _ALU_, that will compute an operation on two values.

There will be also a _decoder_, that will generate all required internal signals
from the bit pattern of the current instruction.

If you want to design a RISC-V processor on your own, I recommend you take a deep look at
[the Stackoverflow answer](https://stackoverflow.com/questions/51592244/implementation-of-simple-microprocessor-using-verilog/51621153#51621153),
and do some schematics on your own to have all the general ideas in mind
before going further... or you can choose to directly jump into this tutorial, one step at a time. It will
gently take you from the most trivial Blinky design to a fully functional RISC-V core.

## Prerequisites:

First step is cloning the learn-fpga repository:
```
$ git clone https://github.com/BrunoLevy/learn-fpga.git
```

Before starting, you will need to install the following softwares:
- iverilog/icarus (simulation)
```
  $ sudo apt-get install iverilog
```
- yosys/nextpnr, the toolchain for your board. See [this link](../toolchain.md).

Note that iverilog/icarus is sufficient to run and play with all the
steps of the tutorial, but the experience is not the same. I highly
recommend to run each step on a real device.  The feeling and
excitation of your own processor running some code for the first time
is not of the same magnitude when you are doing simulation !!!

## Step 1: your first blinky

Let us start and create our first blinky ! Our blinky is implemented as a VERILOG module,
connected to inputs and outputs, as follows ([step1.v](step1.v)):
```verilog
   module SOC (
       input  CLK,
       input  RESET,
       output [4:0] LEDS,
       input  RXD,
       output TXD
   );

   reg [4:0] count = 0;
   always @(posedge CLK) begin
      count <= count + 1;
   end
   assign LEDS = count;
   assign TXD  = 1'b0; // not used for now

   endmodule
```
We call it SOC (System On Chip), which is a big name for a blinky, but
that's what our blinky will be morphed into after all the steps of
this tutorial. Our SOC is connected to the following signals:

- `CLK` (input) is the system clock.
- `LEDS` (output) is connected to the 5 LEDs of the board.
- `RESET` (input) is a reset button. You'll say that the IceStick
   has no button, but in fact ... (we'll talk about that
   later)
- `RXD` and `TXD` (input,output) connected to the FTDI chip that emulates
   a serial port through USB. We'll also talk about that
   later.

You can synthesize and send the bitstream to the device as follows:
```
$ BOARDS/run_xxx.sh step1.v
```
where `xxx` corresponds to your board.

The five leds will light on... but they are not blinking. Why is this so ?
In fact they are blinking, but it is too fast for you to distinguish anything.

To see something, it is possible to use simulation. To use simulation, we write
a new VERILOG file [bench_iverilog.v](bench_iverilog.v),
with a module `bench` that encapsulates our `SOC`:
```verilog
module bench();
   reg CLK;
   wire RESET = 0;
   wire [4:0] LEDS;
   reg  RXD = 1'b0;
   wire TXD;

   SOC uut(
     .CLK(CLK),
     .RESET(RESET),
     .LEDS(LEDS),
     .RXD(RXD),
     .TXD(TXD)
   );

   reg[4:0] prev_LEDS = 0;
   initial begin
      CLK = 0;
      forever begin
	 #1 CLK = ~CLK;
	 if(LEDS != prev_LEDS) begin
	    $display("LEDS = %b",LEDS);
	 end
	 prev_LEDS <= LEDS;
      end
   end
endmodule
```
The module `bench` drives all the signals of our `SOC` (called
`uut` here for "unit under test"). The `forever` loop wiggles
the `CLK` signal and displays the status of the LEDs whenever
it changes.

Now we can start the simulation:
```
  $ iverilog -DBENCH -DBOARD_FREQ=10 bench_iverilog.v step1.v
  $ vvp a.out
```
... but that's a lot to remember, so I created a script for that,
you'll prefer to do:
```
  $ ./run.sh step1.v
```

You will see the LEDs counting. Simulation is precious, it lets
you insert "print" statements (`$display`) in your VERILOG code,
which is not directly possible when you run on the device !

To exit the simulation:
```
  <ctrl><c>
  finish
```
_Note: I developped the first version of femtorv completely on device,
 using only the LEDs to debug because I did not know how to
 use simulation, don't do that, it's stupid !_

**Try this** How would you modify `step1.v` to slow it down
sufficiently for one to see the LEDs blinking ?

**Try this** Can you implement a "Knight driver"-like blinking
pattern instead of counting ?

## Step 2: slower blinky

You probably got it right: the blinky can be slowed-down either
by counting on a larger number of bits (and wiring the most
significant bits to the leds), or inserting a "clock divider"
(also called a "gearbox") that counts on a large number
of bits (and driving the counter
with its most significant bit). The second solution is interesting,
because you do not need to modify your design, you just insert
the clock divider between the `CLK` signal of the board and your
design. Then, even on the device you can distinguish what happens
with the LEDs.

To do that, I created a `Clockworks` module in [clockworks.v](clockworks.v),
that contains the gearbox and a mechanism related with the `RESET` signal (that
I'll talk about later). `Clockworks` is implemented as follows:
```verilog
module Clockworks
(
   input  CLK,   // clock pin of the board
   input  RESET, // reset pin of the board
   output clk,   // (optionally divided) clock for the design.
   output resetn // (optionally timed) negative reset for the design (more on this later)
);
   parameter SLOW;
...
   reg [SLOW:0] slow_CLK = 0;
   always @(posedge CLK) begin
      slow_CLK <= slow_CLK + 1;
   end
   assign clk = slow_CLK[SLOW];
...
endmodule
```
This divides clock frequency by `2^SLOW`.

The `Clockworks` module is then inserted
between the `CLK` signal of the board
and the design, using an internal `clk`
signal, as follows, in [step2.v](step2.v):

```verilog
`include "clockworks.v"

module SOC (
    input  CLK,        // system clock
    input  RESET,      // reset button
    output [4:0] LEDS, // system LEDs
    input  RXD,        // UART receive
    output TXD         // UART transmit
);

   wire clk;    // internal clock
   wire resetn; // internal reset signal, goes low on reset

   // A blinker that counts on 5 bits, wired to the 5 LEDs
   reg [4:0] count = 0;
   always @(posedge clk) begin
      count <= !resetn ? 0 : count + 1;
   end

   // Clock gearbox (to let you see what happens)
   // and reset circuitry (to workaround an
   // initialization problem with Ice40)
   Clockworks #(
     .SLOW(21) // Divide clock frequency by 2^21
   )CW(
     .CLK(CLK),
     .RESET(RESET),
     .clk(clk),
     .resetn(resetn)
   );

   assign LEDS = count;
   assign TXD  = 1'b0; // not used for now
endmodule
```
It also handles the `RESET` signal.

Now you can try it on simulation:
```
  $ ./run.sh step2.v
```

As you can see, the counter is now much slower. Try it also on device:
```
  $ BOARDS/run_xxx.sh step2.v
```
Yes, now we can see clearly what happens ! And what about the `RESET`
button ? The IceStick has no button. In fact it has one !

![](IceStick_RESET.jpg)

Press a finger on the circled region of the image (around pin 47).

**Try this** Knight-driver mode, and `RESET` toggles direction.

If you take a look at [clockworks.v](clockworks.v), you will see it can
also create a `PLL`, it is a component that can be used to generate
*faster* clocks. For instance, the IceStick has a 12 MHz system clock,
but the core that we will generate will run at 45 MHz. We will see that
later.

## Step 3: a blinker that loads LEDs patterns from ROM

Now we got all the tools that we need, so let's see how to
transform this blinker into a fully-functional RISC-V
processor. This goal seems to be far far away, but the
processor we will have created at step 16 is not longer
than 200 lines of VERILOG ! I was amazed to discover
that it is that simple to create a processor. OK, let us
go there one step at a time.

We know already that a processor has a memory, and fetches
instructions from there, in a sequential manner most of
the time (except when there are jumps and branches). Let us
start with something similar, but much simpler: a pre-programmed
christmas tinsel, that loads the LEDs pattern from a memory (see
[step3.v](step3.v)). Our tinsel has a memory with the patterns:
```verilog
   reg [4:0] MEM [0:20];
   initial begin
       MEM[0]  = 5'b00000;
       MEM[1]  = 5'b00001;
       MEM[2]  = 5'b00010;
       MEM[3]  = 5'b00100;
       ...
       MEM[19] = 5'b10000;
       MEM[20] = 5'b00000;
   end
```
_Note that what's in the initial block does not generate any circuitry
when synthesized, it is directly translated into the initialization
data for the BRAMs of the FPGA._

We will also have a "program counter" `PC` incremented at each clock, and
a mechanism to fetch `MEM` contents indexed by `PC`:

```verilog
   reg [4:0] PC = 0;
   reg [4:0] leds = 0;

   always @(posedge clk) begin
      leds <= MEM[PC];
      PC <= (!resetn || PC==20) ? 0 : (PC+1);
   end
```
_Note the test `PC==20` to make it cycle._

Now try it with simulation and on device.

**Try this** create several blinking modes, and switch between
  modes using `RESET`.

## The RISC-V instruction set architecture

An important source of information is of course the
[RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf).
There you learn that there are several flavors of the RISC-V standard.
Let us start from the simplest one (RV32I, that is, 32 bits base integer
instruction set). Then we will see how to add things, one thing at a
time. This is a very nice feature of RISC-V, since the instruction set
is _modular_, you can start with a very small self-contained kernel, and
this kernel will be compliant with the norm. This means standard tools
(compiler, assembler, linker) will be able to generate code for this
kernel. Then I started reading Chapter 2 (page 13 to page 30). Seeing
also the table page 130, there are in fact only 11 different
instrutions ! (I say for instance that an AND, an OR, an ADD ... are
the same instruction, the operation is just an additional parameter).
Now we just try to have an idea of the overall picture,
no need to dive into the details for now. Let's take a global look at these
11 instructions:

| instruction | description                          | algo                                 |
|-------------|--------------------------------------|--------------------------------------|
| branch      | conditional jump, 6 variants         | `if(reg OP reg) PC<-PC+imm`          |
| ALU reg     | Three-registers ALU ops, 10 variants | `reg <- reg OP reg`                  |
| ALU imm     | Two-registers ALU ops, 9 variants    | `reg <- reg OP imm`                  |
| load        | Memory-to-register, 5 variants       | `reg <- mem[reg + imm]`              |
| store       | Register-to-memory, 3 variants       | `mem[reg+imm] <- reg`                |
| `LUI`       | load upper immediate                 | `reg <- (im << 12)`                  |
| `AUIPC`     | add upper immediate to PC            | `reg <- PC+(im << 12)`               |
| `JAL`       | jump and link                        | `reg <- PC+4 ; PC <- PC+imm`         |
| `JALR`      | jump and link register               | `reg <- PC+4 ; PC <- reg+imm`        |
| `FENCE`     | memory-ordering for multicores       | (not detailed here, skipped for now) |
| `SYSTEM`    | system calls, breakpoints            | (not detailed here, skipped for now) |

- The 6 branch variants are conditional jumps, that depend on a test
on two registers.

- ALU operations can be of the form `register <- register OP register`
or `register <- register OP immediate`

- Then we have load and store, that can operate
on bytes, on 16 bit values (called half-words) or 32 bit values
(called words). In addition byte and half-word loads can do sign
expansion. The source/target address is obtained by adding an
immediate offset to the content of a register.

- The remaining instructions are more special (one
may skip their description in a first read, you just need to know
that they are used to implement unconditional jumps, function calls,
memory ordering for multicores, system calls and breaks):

    - `LUI` (load upper immediate) is used to load the upper 20 bits of a constant. The lower
bits can then be set using `ADDI` or `ORI`. At first sight it may
seem weird that we need two instructions to load a 32 bit constant
in a register, but in fact it is a smart choice, because all
instructions are 32-bit long.

    - `AUIPC` (add upper immediate to PC) adds a constant to the current program counter and places the
result in a register. It is meant to be used in combination with
`JALR` to reach a 32-bit PC-relative address.

    - `JAL` (jump and link) adds an offset to the PC and stores the address
of the instruction following the jump in a register. It can be used to
implement function calls. `JALR` does the same thing, but adds the
offset to a register.

    - `FENCE` and `SYSTEMS` are used to implement memory ordering in
multicore systems, and system calls/breaks respectively.

To summarize, we got branches (conditional jumps), ALU operations,
load and store, and a couple of special instructions used to implement
unconditional jumps and function calls. There are also two functions
for memory ordering and system calls (but we will ignore these two
ones for now). OK, in fact only 9 instructions then, it seems doable...
At this point, I had not understood everything, so I'll start from what
I think to be the simplest parts (intruction decoder, register file and ALU), then we will
see how things are interconnected, how to implement jumps, branches, and all the instructions.

## Step 4: the instruction decoder

Now the idea is to have a memory with RISC-V instructions in it, load all instructions
sequentially (like in our christmas tinsel), in an `instr` register, and see how to recognize
among the 11 instructions (and light a different LED in function of the recognized instruction). Each
instruction is encoded in a 32-bits word, and we need to decode the different bits of this word to
recognize the instruction and its parameters.

The [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf)
has all the information that we need summarized in two tables in page 130 (RV32/64G Instruction Set Listings).

Let us take a look at the big table, first thing to notice is that the 7 LSBs tells you which instruction it is
(there are 10 possibilities, we do not count `FENCE` for now).

```verilog
   reg [31:0] instr;
   ...
   wire isALUreg  =  (instr[6:0] == 7'b0110011); // rd <- rs1 OP rs2
   wire isALUimm  =  (instr[6:0] == 7'b0010011); // rd <- rs1 OP Iimm
   wire isBranch  =  (instr[6:0] == 7'b1100011); // if(rs1 OP rs2) PC<-PC+Bimm
   wire isJALR    =  (instr[6:0] == 7'b1100111); // rd <- PC+4; PC<-rs1+Iimm
   wire isJAL     =  (instr[6:0] == 7'b1101111); // rd <- PC+4; PC<-PC+Jimm
   wire isAUIPC   =  (instr[6:0] == 7'b0010111); // rd <- PC + Uimm
   wire isLUI     =  (instr[6:0] == 7'b0110111); // rd <- Uimm
   wire isLoad    =  (instr[6:0] == 7'b0000011); // rd <- mem[rs1+Iimm]
   wire isStore   =  (instr[6:0] == 7'b0100011); // mem[rs1+Simm] <- rs2
   wire isSYSTEM  =  (instr[6:0] == 7'b1110011); // special
```

Besides the instruction type, we need also to decode the arguments of the instruction.
The table on the top distinguishes 6 types of instructions
(`R-type`,`I-type`,`S-type`,`B-type`,`U-type`,`J-type`), depending on the arguments
of the instruction and how they are encoded within the 32 bits of the instruction word.

`R-type` instructions take two source registers `rs1` and `rs2`,
 apply an operation on them and stores the result in a
 third destination register `rd` (`ADD`, `SUB`, `SLL`, `SLT`, `SLTU`, `XOR`,
 `SRL`, `SRA`, `OR`, `AND`).

 Since RISC-V has 32 registers,
 each of `rs1`,`rs2` and `rd` use 5 bits of the instruction
 word. Interestingly, these are the same bits for all
 instruction formats. Hence, "decoding" `rs1`,`rs2`
 and `rd` is just a matter of drawing some wires
 from the instruction word:
```verilog
   wire [4:0] rs1Id = instr[19:15];
   wire [4:0] rs2Id = instr[24:20];
   wire [4:0] rdId  = instr[11:7];
```

 Then, one needs to recognize among the 10 R-type instructions.
 It is done mostly with the `funct3` field, a 3-bits code. With
 a 3-bits code, one can only encode 8 different instructions, hence
 there is also a `funct7` field (7 MSBs of instruction word). Bit
 30 of the instruction word encodes `ADD`/`SUB` and `SRA`/`SRL`
 (arithmetic right shift with sign expansion/logical right shift).
 The instruction decoder has wires for `funct3` and `funct7`:
```verilog
   wire [2:0] funct3 = instr[14:12];
   wire [6:0] funct7 = instr[31:25];
```

`I-type` instructions take one register `rs1`, an immediate value
`Iimm`, applies an operation on them and stores the result in the
destination register `rd` (`ADDI`, `SLTI`, `SLTIU`, `XORI`, `ORI`,
`ANDI`, `SLLI`, `SRLI`, `SRAI`).

_Wait a minute:_ there are 10 R-Type instructions but only 9 I-Type
instructions, why is this so ? If you look carefully, you will see
that there is no `SUBI`, but one can instead use `ADDI` with a
negative immediate value. This is a general rule in RISC-V, if an
existing functionality can be used, do not create a new functionality.

As for R-type instructions, the instruction can be distinguished using
`funct3` and `funct7` (and in `funct7`, only the bit 30 of the instruction
word is used, to distinguish `SRAI`/`SRLI` arithmetic and logical right shifts).

The immediate value is encoded in the 12 MSBs of the instruction word,
hence we will draw additional wires to get it:
```verilog
   wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]};
```

As can be seen, bit 31 of the instruction word is repeated 21 times,
this is "sign expansion" (converts a 12-bits signed quantity into
a 32-bits one).

There are four other instruction formats `S-type` (for Store),
`B-type` (for Branch), `U-type` (for Upper immediates that
are left-shifted by 12), and `J-type` (for Jumps). Each
instruction format has a different way of encoding an immediate
value in the instruction word.

To understand what it means, let's get back to Chapter 2, page 16.
The different instruction types correspond to the way _immediate values_ are encoded in them.

| Instr. type | Description                                    | Immediate value encoding                             |
|-------------|------------------------------------------------|------------------------------------------------------|
| `R-type`    | register-register ALU ops. [more on this here](https://www.youtube.com/watch?v=pVWtI0426mU) | None    |
| `I-type`    | register-immediate integer ALU ops and `JALR`. | 12 bits, sign expansion                              |
| `S-type`    | store                                          | 12 bits, sign expansion                              |
| `B-type`    | branch                                         | 12 bits, sign expansion, upper `[31:1]` (bit 0 is 0) |
| `U-type`    | `LUI`,`AUIPC`                                  | 20 bits, upper `31:12` (bits `[11:0]` are 0)         |
| `J-type`    | `JAL`                                          | 12 bits, sign expansion, upper `[31:1]` (bit 0 is 0) |

Note that `I-type` and `S-type` encode the same type of values (but they are taken from different parts of `instr`).
Same thing for `B-type` and `J-type`.

One can decode the different types of immediates as follows:
```verilog
   wire [31:0] Uimm={    instr[31],   instr[30:12], {12{1'b0}}};
   wire [31:0] Iimm={{21{instr[31]}}, instr[30:20]};
   wire [31:0] Simm={{21{instr[31]}}, instr[30:25],instr[11:7]};
   wire [31:0] Bimm={{20{instr[31]}}, instr[7],instr[30:25],instr[11:8],1'b0};
   wire [31:0] Jimm={{12{instr[31]}}, instr[19:12],instr[20],instr[30:21],1'b0};
```
Note that `Iimm`, `Simm`, `Bimm` and `Jimm` do sign expansion (by copying
bit 31 the required number of times to fill the MSBs).

And that's all for our instruction decoder ! To summarize, the instruction
decoder gets the following information from the instruction word:
- signals isXXX that recognizes among the 11 possible RISC-V instructions
- source and destination registers `rs1`,`rs2` and `rd`
- function codes `funct3` and `funct7`
- the five formats for immediate values (with sign expansion for `Iimm`, `Simm`, `Bimm` and `Jimm`).

Let us now initialize the memory with a few RISC-V instruction and see whether we can recognize them
by lighting a different LED depending on the instruction ([step4.v](step4.v)). To do that, we use
the big table in page 130 of the
[RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf).
It is a bit painful (we will see easier ways later !). Using the `_` character to separate fields of a binary constant is
especially interesting under this circumstance.

```verilog
   initial begin
      // add x1, x0, x0
      //                    rs2   rs1  add  rd  ALUREG
      MEM[0] = 32'b0000000_00000_00000_000_00001_0110011;
      // addi x1, x1, 1
      //             imm         rs1  add  rd   ALUIMM
      MEM[1] = 32'b000000000001_00001_000_00001_0010011;
      ...
      // lw x2,0(x1)
      //             imm         rs1   w   rd   LOAD
      MEM[5] = 32'b000000000000_00001_010_00010_0000011;
      // sw x2,0(x1)
      //             imm   rs2   rs1   w   imm  STORE
      MEM[6] = 32'b000000_00001_00010_010_00000_0100011;
      // ebreak
      //                                        SYSTEM
      MEM[7] = 32'b000000000001_00000_000_00000_1110011;
   end
```

Then we can fetch and recognize the instructions as follows:
```verilog
   always @(posedge clk) begin
      if(!resetn) begin
	 PC <= 0;
      end else if(!isSYSTEM) begin
	 instr <= MEM[PC];
	 PC <= PC+1;
      end
   end
   assign LEDS = isSYSTEM ? 31 : {PC[0],isALUreg,isALUimm,isStore,isLoad};
```
(first led is wired to `PC[0]` so that we will see it blinking even if
 there is the same instruction several times).

As you can see, the program counter is only incremented if instruction
is not `SYSTEM`. For now, the only `SYSTEM` instruction that we support
is `EBREAK`, that halts execution.

In simulation mode, we can in addition display the name of the recognized instruction
and the fields:
```verilog
`ifdef BENCH
   always @(posedge clk) begin
      $display("PC=%0d",PC);
      case (1'b1)
	isALUreg: $display("ALUreg rd=%d rs1=%d rs2=%d funct3=%b",rdId, rs1Id, rs2Id, funct3);
	isALUimm: $display("ALUimm rd=%d rs1=%d imm=%0d funct3=%b",rdId, rs1Id, Iimm, funct3);
	isBranch: $display("BRANCH");
	isJAL:    $display("JAL");
	isJALR:   $display("JALR");
	isAUIPC:  $display("AUIPC");
	isLUI:    $display("LUI");
	isLoad:   $display("LOAD");
	isStore:  $display("STORE");
	isSYSTEM: $display("SYSTEM");
      endcase
   end
`endif
```

**Try this** run `step4.v` in simulation and on the device. Try initializing the memory with
different RISC-V instruction and test whether the decoder recognizes them.

## Sidebar: the elegance of RISC-V

This paragraph may be skipped.
it just contains my own impressions and reflexions on the RISC-V instruction set, inspired by the comments and Q&A in italics in the
[RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf).

At this point, I realized what an _instruction set architecture_ means: it is for sure a specification of _what bit pattern does what_
(Instruction Set) and it is also at the same time driven by how this will be translated into wires (Architecture). An ISA is not
_abstract_, it is _independent_ on an implementation, but it is strongly designed with implementation in mind ! While the
pipeline, branch prediction unit, multiple execution units, caches may differ in different implementations, the instruction decoder
is probably very similar in all implementations.

There were things that seemed really weird to me
in the first place: all these immediate format variants, the fact that immediate values are scrambled in different bits of `instr`,
the `zero` register, and the weird instructions `LUI`,`AUIPC`,`JAL`,`JALR`. When writing the instruction decoder, you better understand the reasons. The
ISA is really smart, and is the result of a long evolution (there were RISC-I, RISC-II, ... before). It seems to me the result of a
_distillation_. Now, in 2020, many things were tested in terms of ISA, and this one seems to have benefited from all the previous
attempts, taking the good choices and avoiding the suboptimal ones.

What is really nice in the ISA is:
- instruction size is fixed. Makes things really easier. _(there are extension with varying instrution length, but at least the core
  instruction set is simple)_;
- `rs1`,`rs2`,`rd` are always encoded by the same bits of `instr`;
- the immediate formats that need to do sign expansion do it from the same bit (`instr[31]`);
- the weird instructions `LUI`,`AUIPC`,`JAL`,`JALR` can be combined to implement higher-level tasks
   (load 32-bit constant in register, jump to arbitrary address, function calls). Their existence is
   justified by the fact it makes the design easier. Then assembly programmer's life is made easier by
   _pseudo-instructions_ `CALL`, `RET`, ... See [risc-v assembly manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md), the
   two tables at the end of the page. Same thing for tests/branch instructions obtained by swapping parameters (e.g. `a < b <=> b > a`
   etc...), there are pseudo-instructions that do the job for you.

Put differently, to appreciate the elegance of the RISC-V ISA, imagine
that your mission is to _invent it_. That is, invent both the set of
instructions and the way they are encoded as bit patterns. The constraints are:
- fixed instruction length (32 bits)
- as simple as possible: the ultimate sophistication is simplicity [Leonardo da Vinci] !!
- source and destination registers always encoded at the same position
- whenever there is sign-extension, it should be done from the same bit
- it should be simple to load an arbitrary 32-bits immediate value in a register (but may take several instructions)
- it should be simple to jump to arbitrary memory locations (but may take several instructions)
- it should be simple to implement function calls (but may take several instructions)

Then you understand why there are many different immediate
formats. For instance, consider `JAL`, that does not have a source
register, as compared to `JALR` that has one. Both take an immediate
value, but `JAL` has 5 more bits available to store it, since it does
not need to encode the source register. The slightest available bit is
used to extend the dynamic range of the immediates. This explains both
the multiple immediate formats and the fact that they are assembled
from multiple pieces of `instr`, slaloming between the three fixed
5-bits register encodings, that are there or not depending on the
cases.

Now the rationale behind the weird instructions `LUI`,`AUIPC`,`JAL`
and `JALR` is to give a set of functions that can be combined to load
arbitrary 32-bit values in register, or to jump to arbitrary locations
in memory, or to implement the function call protocol as simply as
possible. Considering the constraints, the taken choices (that seemed
weird to me in the first place) perfectly make sense. In addition,
with the taken choices, the instruction decoder is pretty simple and
has a low logical depth. Besides the 7-bits instruction decoder, it
mostly consists of a set of wires drawn from the bits of `instr`, and
duplication of the sign-extended bit 31 to form the immediate values.

Before moving forward, I'd like to say a word about the `zero` register.
I think it is really a smart move. With it, you do not need a `MOV rd rs`
instruction (just `ADD rd rs zero`), you do not need a `NOP`
instruction (`ADD zero zero zero`), and all the branch variants can
compare with `zero` ! I think that `zero` is a great invention, not as great
as `0`, but really makes the instruction set more compact.

## Step 5: The register bank and the state machine

The register bank is implemented as follows:
```verilog
   reg [31:0] RegisterBank [0:31];
```

Let us take a closer look at what we need to to to execute an instruction.
Consider for instance a stream of R-type instructions. For each instruction,
we need to do the following four things:

- fetch the instruction: `instr <= MEM[PC]`
- fetch the values of `rs1` and `rs2`: `rs1 <= RegisterBank[rs1Id]; rs2 <= RegisterBank[rs2Id]`
   where `rs1` and `rs2` are two registers. We need to do that because `RegisterBank` will be
   synthesized as a block of BRAM, and one needs one cycle to access the content of BRAM.
- compute `rs1` `OP` `rs2` (where `OP` depends on `funct3` and `funct7`)
- store the result in `rd`: `RegisterBank[rdId] <= writeBackData`. This can be done during
  the same cycle as the previous step if `OP` is computed by a combinatorial circuit.

The first three operations are implemented by a state machine,
as follows (see [step5.v](step5.v)):
```verilog
   localparam FETCH_INSTR = 0;
   localparam FETCH_REGS  = 1;
   localparam EXECUTE     = 2;
   reg [1:0] state = FETCH_INSTR;
   always @(posedge clk) begin
	 case(state)
	   FETCH_INSTR: begin
	      instr <= MEM[PC];
	      state <= FETCH_REGS;
	   end
	   FETCH_REGS: begin
	      rs1 <= RegisterBank[rs1Id];
	      rs2 <= RegisterBank[rs2Id];
	      state <= EXECUTE;
	   end
	   EXECUTE: begin
	      PC <= PC + 1;
	      state <= FETCH_INSTR;
	   end
	 endcase
      end
   end
```

The fourth one (register write-back) is implemented in this block:
```verilog
   wire [31:0] writeBackData = ... ;
   wire writeBackEn = ...;
   always @posedge(clk) begin
      if(writeBackEn && rdId != 0) begin
          RegisterBank[rdId] <= writeBackData;
      end
   end
```
Remember that writing to register 0 has no effect (hence the test `rdId != 0`).
The signal `writeBackEn` is asserted whenever `writeBackData` should be written
to register `rdId`.
The data to be written back (`writeBackData`) will be obtained from the ALU,
as explained in the next episode.

**Try this**: run [step5.v](step5.v) in simulation and on the device. You will
see your wannabe CPU's state machine dancing waltz on the LEDs (that display
the current state).

## Step 6: the ALU

Now we can fetch instructions from memory, decode them and read register
values, but our (wannabe) CPU is still unable to do anything. Let us see
how to do actual computations on register's values.

_So, are you going to create an `ALU` module ? And by the way, why did not
 you create a `Decoder` module, and a `RegisterBank` module ?_

My very first design used multiple modules and multiple files, for
a total of 1000 lines of code or so, then Matthias Koch wrote a monolithic
version, that fits in 200 lines of code. Not only it is more compact, but
also it is much easier to understand when you got everything in one place.
**Rule of thumb:** if you have more boxes and wires between the boxes than
circuitry in the boxes, then you have too many boxes !

_But wait a minute, modular design is good, no ?_

Modular design is neither good nor bad, it is useful whenever it makes things
simpler. It is not the case in the present situation. There is no absolute
answer though, it is a matter of taste and style ! In this tutorial, we use
a (mostly) monolithic design.

Now we want to implement two types of instructions:
- Rtype: `rd` <- `rs1` `OP` `rs2`  (recognized by `isALUreg`)
- Itype: `rd` <- `rs1` `OP` `Iimm` (recognized by `isALUimm`)

The ALU takes two inputs `aluIn1` and `aluIn2`, computes
`aluIn1` `OP` `aluIn2` and stores it in `aluOut`:
```verilog
   wire [31:0] aluIn1 = rs1;
   wire [31:0] aluIn2 = isALUreg ? rs2 : Iimm;
   reg [31:0] aluOut;
```
Depending on the instruction type, `aluIn2` is either the value
in the second source register `rs2`, or an immediate in the `Itype`
format (`Immm`). The operation `OP` depends mostly on `funct3`
(and also on `funct7`). Keep a copy of the [RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf) open page 130 on your knees or in another window:

| funct3 | operation                                     |
|--------|-----------------------------------------------|
| 3'b000 | `ADD` or `SUB`                                |
| 3'b001 | left shift                                    |
| 3'b010 | signed comparison (<)                         |
| 3'b011 | unsigned comparison (<)                       |
| 3'b100 | `XOR`                                         |
| 3'b101 | logical right shift or arithmetic right shift |
| 3'b110 | `OR`                                          |
| 3'b111 | `AND`                                         |

- for `ADD`/`SUB`, if its an `ALUreg` operation (Rtype), then one makes the
difference between `ADD` and `SUB` by testing bit 5 of `funct7` (1 for `SUB`).
If it is an `ALUimm` operation (Itype), then it can be only `ADD`. In this
context, one just needs to test bit 5 of `instr` to distinguish between
`ALUreg` (if it is 1) and `ALUimm` (if it is 0).
- for logical or arithmetic right shift, one makes the difference also by testing
bit 5 of `funct7`, 1 for arithmetic shift (with sign expansion) and 0 for
logical shift.
- the shift amount is either the content of `rs2` for `ALUreg` instructions or
  `instr[24:20]` (the same bits as `rs2Id`) for `ALUimm` instructions.

Putting everything together, one gets the following VERILOG code for the ALU:
```verilog
   reg [31:0] aluOut;
   wire [4:0] shamt = isALUreg ? rs2[4:0] : instr[24:20]; // shift amount
   always @(*) begin
      case(funct3)
	3'b000: aluOut = (funct7[5] & instr[5]) ? (aluIn1-aluIn2) : (aluIn1+aluIn2);
	3'b001: aluOut = aluIn1 << shamt;
	3'b010: aluOut = ($signed(aluIn1) < $signed(aluIn2));
	3'b011: aluOut = (aluIn1 < aluIn2);
	3'b100: aluOut = (aluIn1 ^ aluIn2);
	3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) : (aluIn1 >> shamt);
	3'b110: aluOut = (aluIn1 | aluIn2);
	3'b111: aluOut = (aluIn1 & aluIn2);
      endcase
   end
```
_Note:_ although it is declared as a `reg`, `aluOut` will be a combinatorial function
(no flipflop generated), because its value is determined in a combinatorial block
(`always @(*)`), and all the configurations are enumerated in the `case` statement.

Register write-back is configured as follows:
```verilog
   assign writeBackData = aluOut;
   assign writeBackEn = (state == EXECUTE && (isALUreg || isALUimm));
```

**Try this** run [step6.v](step6.v) in simulation and on the device. In simulation
it will display the written value and the written register for all register
write-back operation. On the device it will show the 5 LSBs of `x1` on the LEDs.
Then you can try changing the program, and observe the effect on register values.

**You are here !** This is the list of instructions you have to implement,
your wannabe RISC-V core currently supports 20 of them. Next steps: jumps,
then branches, then... the rest. Before then, as you probably have noticed,
translating RISC-V programs into binary (that is, assembling manually) is
extremely painful. Next section gives a much easier solution.

| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [ ] 2 | [ ] 6  | [ ] | [ ]   | [ ] 5 | [ ] 3 | [*] 1  |

## Step 7: using the VERILOG assembler

To avoid having to manually translate RISC-V assembly into binary, one can
use the GNU assembler, generate a binary file, translate it into hexadecimal
and use the VERILOG function `readmemh()` to initialize memory with the
content of that file. We will see later how to do that.

But in our case, it would be very convenient to be able to write small
assembly programs directly in the same VERILOG file as our design. In fact,
it is possible to do so, by implementing a RISC-V assembler directly in
VERILOG (using tasks and functions), as done in [riscv_assembly.v](riscv_assembly.v).

In [step7.v](step7.v), memory is initialized with
the same assembly program as in [step6.v](step6.v).
It looks like that now, Much easier to read, no ?
```verilog
   `include "riscv_assembly.v"
   initial begin
      ADD(x0,x0,x0);
      ADD(x1,x0,x0);
      ADDI(x1,x1,1);
      ADDI(x1,x1,1);
      ADDI(x1,x1,1);
      ADDI(x1,x1,1);
      ADD(x2,x1,x0);
      ADD(x3,x1,x2);
      SRLI(x3,x3,3);
      SLLI(x3,x3,31);
      SRAI(x3,x3,5);
      SRLI(x1,x3,26);
      EBREAK();
   end
```
_Note:_ `riscv_assembly.v` needs to be included from inside the module that
uses assembly.

In this step, we make another modification: in the previous steps, `PC` was
the index of the current instruction. For what follows, we want it to be
the _address_ of the current instruction. Since each instruction is 32-bits
long, it means that:
- to increment `PC`, we do `PC <= PC + 4` (instead of `PC <= PC + 1` as before)
- to fetch the current instruction, we do `instr <= MEM[PC[31:2]];` (we ignore
  the two LSBs of `PC`).

## Step 8: jumps

There are two jump instructions, `JAL` (jump and link), and `JALR` (jump and
link register). By "and link", one means that the current PC can be written
to a register. Hence `JAL` and `JALR` can be used to implement not only
jumps, but also function calls. Here is what the two instructions are
supposed to do:

| instruction     | effect                  |
|-----------------|-------------------------|
| JAL rd,imm      | rd<-PC+4; PC<-PC+Jimm   |
| JALR rd,rs1,imm | rd<-PC+4; PC<-rs1+Iimm  |

To implement these two instructions, we need to make
the following changes to our core. First thing is
register write-back: now value can be `PC+4` instead
of `aluOut` for jump instructions:
```verilog
   assign writeBackData = (isJAL || isJALR) ? (PC + 4) : aluOut;
   assign writeBackEn = (state == EXECUTE &&
			 (isALUreg ||
			  isALUimm ||
			  isJAL    ||
			  isJALR)
			 );
```

We also need to declare a `nextPC` value, that implements the
three possibilities:
```verilog
   wire [31:0] nextPC = isJAL  ? PC+Jimm :
	                isJALR ? rs1+Iimm :
	                PC+4;
```

Then, in the state machine, the line `PC <= PC + 4;` is replaced
with `PC <= nextPC;` and that's all !

We can now implement a simple (infinite) loop to test our new
jump instruction:
```verilog
`include "riscv_assembly.v"
      integer L0_=4;
      initial begin
	 ADD(x1,x0,x0);
      Label(L0_);
	 ADDI(x1,x1,1);
	 JAL(x0,LabelRef(L0_));
	 EBREAK();
	 endASM();
      end
```

The integer `L0_` is a label. Unlike with a real assembler, we
need to specify the value of `L0_` by hand. Here it is easy,
`L0_` is right after the first instruction, hence it corresponds
to the beginning of the RAM (0) plus one 32-bits words, that is, 4.
For longer programs with many labels, you can let the labels uninitialized
(`integer L0_;`) then the first time you run the program, it will compute and display the
values to be used for the labels. It is not super-convenient, but still
much better than assembling by hand / determining the labels by hand.

The `LabelRef()` function computes the label's offset relative to the current program
counter. In addition, in simulation mode, it displays the current address (to be used
to initialize the label), and if the label was already initialized (like here with `L0_=4`)
it checks that the label corresponds to the current address generated by the assembler. If
it is not the case, the `endASM()` statement displays an error message and exits.

_Note 1_: I systematically insert an `EBREAK()` instruction at the end of the program,
here it would not be necessary (we have an infinite loop), but if I change my mind
and exit the loop, then `EBREAK()` is already there.

_Note 2_: the `endASM();` statement checks the validity of all the labels and exits
simulation whenever an invalid label is detected. If you use the RISC-V VERILOG
assembler, systematically run your design in simulation before synthesizing (because
this verification cannot be done at synthesis time).

**Try this** Run the design [step8.v](step8.v) in simulation and on the device.
Yes, after 8 steps, what we have is just another stupid blinky ! But this time,
this blinky is executing a real RISC-V program ! It is not a complete RISC-V core
yet, but it starts to have a strong RISC-V flavor. Be patient, our core will be
soon able to run RISC-V programs that are more interesting than a blinky.

**You are here !**
Still some work to do, but we are making progress.
| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [*] 2 | [ ] 6  | [ ] | [ ]   | [ ] 5 | [ ] 3 | [*] 1  |

**Try this** add a couple of instructions before the loop, run in simulation,
fix the label as indicated by the simulator, re-run in simulation, run on device.

## Step 9: Branches

Branches are like jumps, except that they compare two register, and update
`PC` based on the result of the comparison. Another difference is that they
are more limited in the address range they can reach from `PC` (12-bits offset).
There are 6 different branch instructions:

| instruction      | effect                                             |
|------------------|----------------------------------------------------|
| BEQ rs1,rs2,imm  | if(rs1 == rs2) PC <- PC+Bimm                       |
| BNE rs1,rs2,imm  | if(rs1 != rs2) PC <- PC+Bimm                       |
| BLT rs1,rs2,imm  | if(rs1 <  rs2) PC <- PC+Bimm (signed comparison)   |
| BGE rs1,rs2,imm  | if(rs1 >= rs2) PC <- PC+Bimm (signed comparison)   |
| BLTU rs1,rs2,imm | if(rs1 <  rs2) PC <- PC+Bimm (unsigned comparison) |
| BGEU rs1,rs2,imm | if(rs1 >= rs2) PC <- PC+Bimm (unsigned comparison) |

_Wait a minute:_ there is `BLT`, but where is `BGT` ? Always the same
principle in a RISC-V processor: if something can be done with a functionality
that is already there, do not add a new functionality ! In this case,
`BGT rs1,rs2,imm` is equivalent to `BLT rs2,rs1,imm` (just swap the first
two operands). If you use `BGT` in a RISC-V assembly program, it will work
(and the assembler replaces it with `BLT` with swapped operands). `BGT`
is called a "pseudo-instruction". There are many pseudo-instructions to make
RISC-V assembly programmer's life easier (more on this later).

Back to our branch instructions, we will need to add in the ALU some wires
to compute the result of the test, as follows:
```verilog
   reg takeBranch;
   always @(*) begin
      case(funct3)
	3'b000: takeBranch = (rs1 == rs2);
	3'b001: takeBranch = (rs1 != rs2);
	3'b100: takeBranch = ($signed(rs1) < $signed(rs2));
	3'b101: takeBranch = ($signed(rs1) >= $signed(rs2));
	3'b110: takeBranch = (rs1 < rs2);
	3'b111: takeBranch = (rs1 >= rs2);
	default: takeBranch = 1'b0;
      endcase
```
_Note 1_ it is possible to create a much more compact ALU, that uses a much smaller number
of LUTs when synthesized, we sill see that later (for now, our goal is to have a RISC-V
processor that works, we will optimize it later).

_Note 2_ Among the 8 possibilites given by `funct3`, only 6 of them are used by the branch
instructions. It is necessary to have a `default:` statement in the `case`, else the
synthesizer would not be able to keep `takeBranch` as purely combinatorial (and would generate
a latch, which we do not want).

Now the only thing that remains to do for implementing branches is to add a case for
`nextPC`, as follows:
```verilog
   wire [31:0] nextPC = (isBranch && takeBranch) ? PC+Bimm :
   	                isJAL                    ? PC+Jimm :
	                isJALR                   ? rs1+Iimm :
	                PC+4;
```

We are now ready to test a simple loop, that counts from 0 to 31,
displays each iteration on the LEDs (remember, they are wired
to `x1`) and stops:

```c++
`include "riscv_assembly.v"
      integer L0_ = 8;

      initial begin
         ADD(x1,x0,x0);
         ADDI(x2,x0,32);
      Label(L0_);
	 ADDI(x1,x1,1);
         BNE(x1, x2, LabelRef(L0_));
         EBREAK();

	 endASM();
      end
```

**Try this** run [step9.v](step9.v) in simulation and on device. Try modifying the program,
create a "knight driver" blinky with an outer loop and two inner loops (one left to right and
one right to left).

**You are here !**
Wow, we have implemented 28 instructions out of 38 ! Let us continue...

| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [*] 2 | [ *] 6 | [ ] | [ ]   | [ ] 5 | [ ] 3 | [*] 1  |

## Step 10: LUI and AUIPC

We still have these two weird instructions to implement. What do they do ?
It is rather simple:

| instruction   | effect          |
|---------------|-----------------|
| LUI rd, imm   | rd <= Uimm      |
| AUIPC rd, imm | rd <= PC + Uimm |

And if you look at the `Uimm` format, it reads its MSBs (`imm[31:12]`) from
the immediate encoded in the instructions. The 12 LSBs are set to zero.
These two instructions are super useful: the immediate formats supported by all the
other instructions can only modify the LSBs. Combined with these two
functions, one can load an arbitrary value in a register (but this can
require up to two instructions).

Implementing these two instructions just requires to change `writeBackEn` and
`writeBackData` as follows:
```verilog
   assign writeBackData = (isJAL || isJALR) ? (PC + 4) :
			  (isLUI) ? Uimm :
			  (isAUIPC) ? (PC + Uimm) :
			  aluOut;

   assign writeBackEn = (state == EXECUTE &&
			 (isALUreg ||
			  isALUimm ||
			  isJAL    ||
			  isJALR   ||
			  isLUI    ||
			  isAUIPC)
			 );
```

**You are here !**
Seems that we are nearly there ! 8 instructions to go...

| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [*] 2 | [ *] 6 | [*] | [*]   | [ ] 5 | [ ] 3 | [*] 1  |


**Try this** run [step10.v](step10.v) in simulation and on the device.

_Argh !!_ On my icestick, it does not fit (requires 1283 LUTs and the
IceStick only has 1280). What can we do ? Remember, we absolutely took
no care about resource consumption, just trying to write a design that
works. In fact, there is _a lot_ of room for improvement in our design,
we will see that later, but before then, let's organize our SOC a
bit better (then we will shrink the processor).

## Step 11: Memory in a separate module

In our previous designs, we got everything in our `SOC` module (memory and
processor). In this step, we will see how to separate them.

First, the `Memory` module:

```verilog
module Memory (
   input             clk,
   input      [31:0] mem_addr,  // address to be read
   output reg [31:0] mem_rdata, // data read from memory
   input   	     mem_rstrb  // goes high when processor wants to read
);
   reg [31:0] MEM [0:255];

`include "riscv_assembly.v"
   integer L0_=8;
   initial begin
                  ADD(x1,x0,x0);
                  ADDI(x2,x0,31);
      Label(L0_); ADDI(x1,x1,1);
                  BNE(x1, x2, LabelRef(L0_));
                  EBREAK();
      endASM();
   end

   always @(posedge clk) begin
      if(mem_rstrb) begin
         mem_rdata <= MEM[mem_addr[31:2]];
      end
   end
endmodule
```

In its interface, there is a `clk` signal connected to the clock.
Whenever the processor wants to read in memory, it positions the
address to be read on `mem_addr`, and sets `mem_rstrb` to 1. Then
the `Memory` module returns the data to be read on `mem_rdata`.

Symetrically, the `Processor` module has a `mem_addr` signal (as
`output` this time), a `mem_rdata` signal (as input) and a
`mem_rstrb` signal (as output):

```verilog
module Processor (
    input 	      clk,
    input 	      resetn,
    output     [31:0] mem_addr,
    input      [31:0] mem_rdata,
    output 	      mem_rstrb,
    output reg [31:0] x1
);
...
endmodule
```
(in addition, we have a `x1` signal that contains the contents
of register `x1`, that can be used for visual debugging. We will
plug it to the LEDs).

The state machine has one additional state:
```verilog
   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam FETCH_REGS  = 2;
   localparam EXECUTE     = 3;

   case(state)
     FETCH_INSTR: begin
       state <= WAIT_INSTR;
     end
     WAIT_INSTR: begin
       instr <= mem_rdata;
       state <= FETCH_REGS;
     end
     FETCH_REGS: begin
       rs1 <= RegisterBank[rs1Id];
       rs2 <= RegisterBank[rs2Id];
       state <= EXECUTE;
     end
     EXECUTE: begin
        if(!isSYSTEM) begin
  	   PC <= nextPC;
	end
	state <= FETCH_INSTR;
      end
   endcase
```
_Note_ we will see later how to simplify it and get back to three states.

Now, `mem_addr` and `mem_rstrb` can be wired as follows:
```verilog
   assign mem_addr = PC;
   assign mem_rstrb = (state == FETCH_INSTR);
```

And finally, everything is installed and connected in the `SOC`
```verilog
module SOC (
    input  CLK,        // system clock
    input  RESET,      // reset button
    output [4:0] LEDS, // system LEDs
    input  RXD,        // UART receive
    output TXD         // UART transmit
);
   wire    clk;
   wire    resetn;
   Memory RAM(
      .clk(clk),
      .mem_addr(mem_addr),
      .mem_rdata(mem_rdata),
      .mem_rstrb(mem_rstrb)
   );

   wire [31:0] mem_addr;
   wire [31:0] mem_rdata;
   wire mem_rstrb;
   wire [31:0] x1;
   Processor CPU(
      .clk(clk),
      .resetn(resetn),
      .mem_addr(mem_addr),
      .mem_rdata(mem_rdata),
      .mem_rstrb(mem_rstrb),
      .x1(x1)
   );
   assign LEDS = x1[4:0];

   // Gearbox and reset circuitry.
   Clockworks #(
     .SLOW(19) // Divide clock frequency by 2^19
   ) CW (
     .CLK(CLK),
     .RESET(RESET),
     .clk(clk),
     .resetn(resetn)
   );

   assign TXD  = 1'b0; // not used for now
endmodule
```

Now you can run [step11.v](step11.v) in the simulator. As expected,
it does the same thing as in the previous step (counts on the LEDs
from 0 to 31 and stops). What about running it on the device ?
Wow, even worse, 1341 LUTs (and we only got 1280 of them on the IceStick).
So let us shrink our code to make it fit !

## Step 12: Size optimization: the Incredible Shrinking Core.

_Tribute to "the Incredible Shrinking Man" classic movie_

There are many things we can do for shrinking this core. Let us
first take a look at the ALU. It can compute addition, subtraction,
and comparisons. Can't we reuse the result of subtraction for comparisons ?
Sure we can, but to do that we need to compute a 33 bits subtraction, and
test the sign bit. Matthias Koch (@Mecrisp) explained me this trick, that
is also used in swapforth/J1 (another small RISC core that works on
the IceStick).  The 33 bits subtract is written as follows:
```verilog
   wire [32:0] aluMinus = {1'b0,aluIn1} - {1'b0,aluIn2};
```
if you want to know what `A-B` does in Verilog, it corresponds
to `A+~B+1` (negate all the bits of B before adding, and add 1), it
is how two's complement subtraction works. For instance, take
`4'b0000 - 4'b0001`, the result is `-1`, encoded as `4'b1111`. It is
computed as follows by the formula: `4'b0000 + ~4'b0001 + 1` = `4'b0000 + 4'b1110 + 1`
= `4'b1111`. So we will keep the following expression (we could have kept the
simpler form above, but it is interesting to be aware of what happens under the
scene):
```verilog
   wire [32:0] aluMinus = {1'b1, ~aluIn2} + {1'b0,aluIn1} + 33'b1;
```
Then we can create the wires for the three tests (this saves three 32-bit
adders):
```
   wire        EQ  = (aluMinus[31:0] == 0);
   wire        LTU = aluMinus[32];
   wire        LT  = (aluIn1[31] ^ aluIn2[31]) ? aluIn1[31] : aluMinus[32];
```

- The first one, `EQ`, goes high when `aluIn1` and `aluIn2` have the same value, or
`aluMinus == 0` (no need to test the 33-rd bit)
- the second one, `LTU`, corresponds to unsigned comparison. It is given by the sign bit of
our 33-bits subtraction.
- for the third one, there are two cases: if the signs differ, then `LT` goes high if
  `aluIn1` is negative, else it is given by the sign bit of our 33-bits subtraction.

Of course, we still need one adder for addition:
```verilog
   wire [31:0] aluPlus = aluIn1 + aluIn2;
```

Then, `aluOut` is computed as follows:
```verilog
   reg [31:0]  aluOut;
   always @(*) begin
      case(funct3)
	3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
	3'b001: aluOut = aluIn1 << shamt;;
	3'b010: aluOut = {31'b0, LT};
	3'b011: aluOut = {31'b0, LTU};
	3'b100: aluOut = (aluIn1 ^ aluIn2);
	3'b101: aluOut = funct7[5]? ($signed(aluIn1) >>> shamt) :
			 ($signed(aluIn1) >> shamt);
	3'b110: aluOut = (aluIn1 | aluIn2);
	3'b111: aluOut = (aluIn1 & aluIn2);
      endcase
   end
```

Let us try on the IceStick. Yes ! 1167 LUTs, it fits ! But it is not a
good reason to stop there, there are still several opportunities to
shrink space. Let us take a look at `takeBranch`, can't we reuse the
`EQ`,`LT`,`LTU` signals we just created ? Sure we can:

```verilog
   reg takeBranch;
   always @(*) begin
      case(funct3)
	3'b000: takeBranch = EQ;
	3'b001: takeBranch = !EQ;
	3'b100: takeBranch = LT;
	3'b101: takeBranch = !LT;
	3'b110: takeBranch = LTU;
	3'b111: takeBranch = !LTU;
	default: takeBranch = 1'b0;
      endcase
   end
```

For this to work, we also need to make sure that `rs2` is routed to the
second ALU input also for branches:

```verilog
   wire [31:0] aluIn2 = isALUreg | isBranch ? rs2 : Iimm;
```

What does it give on the device ? 1094 LUTs, not that bad, but let us continue...
The jump target for `JALR` is `rs1+Iimm`, and we created an adder especially for
that, it is stupid because the ALU already computes that. OK let us reuse it:

```verilog
   wire [31:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm  :
	                isJALR                              ? {aluPlus[31:1],1'b0}:
	                PCplus4;
```

How do we stand now ? 1030 LUTs. And it is not finished: what eats-up the largest
number of LUTs is the shifter, and we have three of them in the ALU (one for left
shifts, one for logical right shifts and one for arithmetic right shifts).
By another sorcerer's trick indicated by by Matthias Koch (@mecrisp), it is
possible to merge the two right shifts, by creating a 33 bits shifter with the
additional bit set to 0 or 1 depending on input's bit31 and on whether it is a
logical shift or an arithmetic shift.
```verilog
   wire [31:0] shifter =
               $signed({instr[30] & aluIn1[31], shifter_in}) >>> aluIn2[4:0];
```

Even better, Matthias told me it is possible to use in fact a single shifter, by flipping the
input and flipping the output if it is a left shift:
```verilog
   wire [31:0] shifter_in = (funct3 == 3'b001) ? flip32(aluIn1) : aluIn1;
   wire [31:0] leftshift = flip32(shifter);
```

The ALU then looks like that:
```verilog
   reg [31:0]  aluOut;
   always @(*) begin
      case(funct3)
	3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
	3'b001: aluOut = leftshift;
	3'b010: aluOut = {31'b0, LT};
	3'b011: aluOut = {31'b0, LTU};
	3'b100: aluOut = (aluIn1 ^ aluIn2);
	3'b101: aluOut = shifter;
	3'b110: aluOut = (aluIn1 | aluIn2);
	3'b111: aluOut = (aluIn1 & aluIn2);
      endcase
   end
```

Where do we stand now ? 887 LUTs my friend !

_Note 1_ well, in fact one can gain even more space with the shifter, by shifting 1 single bit
at each clock. The ALU then becomes a little bit more complicated (multi-cycle), but much
much smaller (Femtorv32-quark uses this trick). We will see that later.

_Note 2_ with a multi-cycle ALU, we could also have a single 33-bits adder, and compute subtractions
in three cycles, by separating the computation of `~aluIn2`, `aluIn1+(~aluIn2)` and `aluIn1+(~aluIn2)+1`.

Before then, another easy win is factoring the adder used for address computation, as follows:
```verilog
   wire [31:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] :
				  instr[4] ? Uimm[31:0] :
				             Bimm[31:0] );
   wire [31:0] PCplus4 = PC+4;
```

Then these two adders can be used by both `nextPC` and `writeBackData`:
```verilog

   assign writeBackData = (isJAL || isJALR) ? (PCplus4) :
			  (isLUI) ? Uimm :
			  (isAUIPC) ? PCplusImm :
			  aluOut;

   assign writeBackEn = (state == EXECUTE && !isBranch);

   wire [31:0] nextPC = (isBranch && takeBranch || isJAL) ? PC+Imm  :
	                isJALR                   ? {aluPlus[31:1],1'b0} :
	                PCplus;
```

The verdict ? 839 LUTs (we have gained another 50 LUTs or so...). There is still room for
gaining more LUTs (by using a multi-cycle ALU for shifts, and by using a smaller number of
bits for address computation), but we'll keep that for later, since we have now enough room
on the device for the next steps.

## Step 13: subroutines (version 1, using plain RISC-V instructions)

OK, so now we have an (uncomplete) RISC-V processor, a SOC, both fit
on the device. Remember, we are approaching the end, only
8 instructions to go (5 Load variants, 3 Store variants).

| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [*] 2 | [ *] 6 | [*] | [*]   | [ ] 5 | [ ] 3 | [*] 1  |

Before attacking them, let us learn a bit more on RISC-V assembly, and
function calls. Up to now, we have used a gearbox to slow down the CPU in
such a way we can observe it executing our programs. Could'nt we implement
a `wait` function instead and call it ? Let us see how to do that.

First thing to do is to remove the `#(.SLOW(nnn))` parameter in the `Clockworks`
instanciation:
```verilog
   Clockworks CW(
     .CLK(CLK),
     .RESET(RESET),
     .clk(clk),
     .resetn(resetn)
   );
```
this no longer generates a gearbox and directly wires the `CLK` signal of the board
to the internal `clk` signal used by our design.

OK, so now we need to see two different things:
- how to write a function that waits for some time
- how to call it

_Wait a minute_ you are talking about function calls, but we do not have
`Load` / `Store` instructions. We won't be able to push the return address
on the stack (because we cannot read/write memory, and the stack is in memory !),
so how is it possible ?

There would many possible ways of using RISC-V instructions to implement function
calls. To make sure everybody uses the same convention, there is an
**application binary interface** that defines how to call functions, how to
pass parameters, and which register does what. See
[this document](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md) for more details.

**Calling a function** In this document, we learn that for calling a function, the return address will
be stored in `x1`. Hence one can call a function using `JAL(x1,offset)` where
`offset` is the (signed) difference between the program counter and the address
of the function to be called. This works provided the offset fits in 20 bits
(Jimm format).
_Note_: for function that are further away, one can use a combination of `AUIPC` and
`JALR` to reach an arbitrary offset.

**Returning from a function** is done by jumping to the address stored in `x1`, which can
be done by `JALR(x0,x1,0)`.

**Function arguments and return value**: The first 6 function arguments
are passed through `x10`..`x16`, and the return value is passed through `x10`
(it overwrites the first function argument).

That's interesting, even though we do not have `Load`/`Store`, we can write programs
with functions, but we cannot write functions that call other functions, because this
requires saving `x1` to the stack (well in fact nothing forbids us from doing that by
saving `x1` in another register but then it would quickly become a mess, so we won't
do that).

One little thing: we have just learnt that in the ABI, `x1` is used to store the
return address of functions. Up to know we have wired it to the LEDs. Since we
are going now to comply with the ABI, we need to chose another register instead.
From now, `x10` will be wired to the LEDs.

OK, so now we have everything we need to write yet another version of the blinky !
Let us chose a `slow_bit` constant, wire a `wait` function that counts to
`2^slow_bit`, and call it to slow-down our blinky:

```verilog
`ifdef BENCH
   localparam slow_bit=15;
`else
   localparam slow_bit=19;
`endif


`include "riscv_assembly.v"
   integer L0_   = 4;
   integer wait_ = 20;
   integer L1_   = 28;

   initial begin
      ADD(x10,x0,x0);
   Label(L0_);
      ADDI(x10,x10,1);
      JAL(x1,LabelRef(wait_)); // call(wait_)
      JAL(zero,LabelRef(L0_)); // jump(l0_)

      EBREAK(); // I keep it systematically
                // here in case I change the program.

   Label(wait_);
      ADDI(x11,x0,1);
      SLLI(x11,x11,slow_bit);
   Label(L1_);
      ADDI(x11,x11,-1);
      BNE(x11,x0,LabelRef(L1_));
      JALR(x0,x1,0);

      endASM();
   end

   always @(posedge clk) begin
      if(mem_rstrb) begin
         mem_rdata <= MEM[mem_addr[31:2]];
      end
   end
endmodule
```


Try [step13.v](step13.v) in simulation and on the device.

**Try this** Knight-driver blinky, with one routine for going from left to right,
  another routine for going from right to left, and the wait routine. _Hint_ you
  will need to save `x1` to another register.

## Step 14: subroutines (version 2, using RISC-V ABI and pseudo-instructions)

With the ABI, we have a standard way of writing programs, but there are many
things to remember:
- all RISC-V registers are the same, but with the ABI, we need to use certain
  registers for certain tasks (`x1` for return address, `x10`..`x16` for
  function parameters, etc...);
- calling a function is implemented using `JAL` or `AUIPC` and `JALR`,
  and returning from a function is implemented using `JALR`.

On a CISC processor, there are often special functions for calling
functions (`CALL`) and for returning from a function (`RET`), and registers
are often specialized (function return address, stack pointer, function
parameters). This makes programmer's life easier because there is less
to remember. There is no reason not doing the same for a RISC processor !
Let us pretend that the register are different and give them different names
(or aliases). These names are listed
[here](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md#general-registers).

| ABI name          | name | usage                                       |
|-------------------|------|---------------------------------------------|
| `zero`            | `x0` | read:0  write:ignored                       |
| `ra`              | `x1` | return address                              |
| `t0`...`t6`       | ...  | temporary registers                         |
| `fp`,`s0`...`s11` | ...  | saved registers, `fp`=`so`: frame pointer   |
| `a0`...`a7`       | ...  | function parameters and return value (`a0`) |
| `sp`              | `x2` | stack pointer                               |
| `gp`              | `x3` | global pointer                              |

Saved registers (`s0`, ... `s11`) are supposed to be left untouched or
saved/restored by functions. You can put your local variables there.
If you write a function, you are supposed to push the ones you use
on the stack and pop them before returning.

For all the other registers, you cannot expect them to be preserved through
function calls.

The global pointer `gp` can be used as a "shortcut" to reach memory areas that are
far away in 1 instruction. We will see that later (once we have `Load` and `Store`).

In our VERILOG assembler [riscv_assembly.v](riscv_assembly.v), we just need to declare
these aliases for register names:
```verilog
   localparam zero = x0;
   localparam ra   = x1;
   localparam sp   = x2;
   localparam gp   = x3;
   ...
   localparam t4   = x29;
   localparam t5   = x30;
   localparam t6   = x31;
```

Besides these names, there are also _pseudo-instructions_ for common tasks, such as:
 | pseudo-instruction    | action                               |
 |-----------------------|--------------------------------------|
 | `LI(rd,imm)`          | loads a 32-bits number in a register |
 | `CALL(offset)`        | calls a function                     |
 | `RET()`               | return from a function               |
 | `MV(rd,rs)`           | equivalent to `ADD(rd,rs,zero)`      |
 | `NOP()`               | equivalent to `ADD(zero,zero,zero)`  |
 | `J(offset)`           | equivalent to `JAL(zero,offset)`     |
 | `BEQZ(rd1,offset)`    | equivalent to `BEQ(rd1,x0,offset)`   |
 | `BNEZ(rd1,offset)`    | equivalent to `BNE(rd1,x0,offset)`   |
 | `BGT(rd1,rd2,offset)` | equivalent to `BLT(rd2,rd1,offset)`  |

If the constant in the [-2048,2047] range, `LI` is implemented using `ADDI(rd,x0,imm)`, else
it uses a combination of `LUI` and `ADDI` (if you want to know how it works, see this [stackoverflow answer](https://stackoverflow.com/questions/50742420/risc-v-build-32-bit-constants-with-lui-and-addi), there are tricky details about sign expansion).

Using ABI register names and pseudo-instructions, our program becomes as follows:

```verilog
   integer L0_   = 4;
   integer wait_ = 24;
   integer L1_   = 32;

   initial begin
      LI(a0,0);
   Label(L0_);
      ADDI(a0,a0,1);
      CALL(LabelRef(wait_));
      J(LabelRef(L0_));

      EBREAK();

   Label(wait_);
      LI(a1,1);
      SLLI(a1,a1,slow_bit);
   Label(L1_);
      ADDI(a1,a1,-1);
      BNEZ(a1,LabelRef(L1_));
      RET();

      endASM();
   end
```
It does not make a huge difference, but in longer programs, it improves legibility by showing
the intent of the programmer (this one is a function, that one is a jump to a label etc...).
Without it, since everything looks like the same, reading a program is more difficult.

It is quite funny: the RISC-V standard has a super-simple instruction set, but programming with
it is not that easy, so the ABI pretends that the instruction set is more complicated, like a
CISC processor, and this makes programmer's life easier. It also ensures that a function written
by a programmer can be called from a function written by another programmer, possibly in a different
language. We will see later how to use GNU assembler and C compiler to compile programs for our CPU.
But before playing with software and toolchains, remember, we still have 8 instructions to implement
in hardware (5 `Load` variants and 3 `Store` variants).

**Try this** invent (or copy it from [somewhere else](https://github.com/riscv-collab/riscv-gcc/blob/5964b5cd72721186ea2195a7be8d40cfe6554023/libgcc/config/riscv/muldi3.S)) a routine to multiply two numbers, test it on various inputs in simulation, and on the device.

## Step 15: Load

Let us see now how to implement load instructions. There are 5 different instructions:

 | Instruction     | Effect                                                       |
 |-----------------|--------------------------------------------------------------|
 | LW(rd,rs1,imm)  | Load word at address (rs1+imm) into rd                       |
 | LBU(rd,rs1,imm) | Load byte at address (rs1+imm) into rd                       |
 | LHU(rd,rs1,imm) | Load half-word at address (rs1+imm) into rd                  |
 | LB(rd,rs1,imm)  | Load byte at address (rs1+imm) into rd then sign extend      |
 | LH(rd,rs1,imm)  | Load half-word at address (rs1+imm) into rd then sign extend |

_Note_ addresses are aligned on word boundaries for `LW` (multiple of 4 bytes) and
halfword boundaries for `LH`,`LHU` (multiple of 2 bytes). It is a good thing, it
makes things much easier for us...

But we still have some work to do ! First, some circuitry that determines the
loaded value (that we will call `LOAD_data`).

As you can see, we got instructions for loading words, half-words and bytes, and
instructions that load half-words and bytes exist in two versions:
- `LBU`,`LHU` that load a byte,halfword in the LSBs of `rd`
- `LB`,`LH` that load a byte,halfword in the LSBs of `rd` then do sign extensin:

For instance, imagine a sign byte with the value `-1`, that is `8'b11111111`,
loading it in a 32-bit register with `LBU` will result in `32'b0000000000000000000000011111111`,
whereas loading it with `LB` will result in `32'b11111111111111111111111111111111`, that is,
the 32-bits version of `-1`.

So we got a "two-dimensional" array of cases (whether we load a byte, halfword, word, and
whether we do sign extension or not). Well, in fact it is even more complicated. Remember,
our memory is structured into words, so when we load a byte, we need to know which one it
is (among 4), and when we load a halfword, we need to know which one it is (among 2). This
can be done by examining the 2 LSBs of the address of the data to be loaded (`rs1 + Iimm`):

```verilog
   wire [31:0] loadstore_addr = rs1 + Iimm;
   wire [15:0] LOAD_halfword =
	       loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];

   wire  [7:0] LOAD_byte =
	       loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];
```

OK, so now we need to select among `mem_rdata` (`LW`), `LOAD_halfword` (`LH`,`LHU`)
and `LOAD_byte` (`LB`,`LBU`). Examining the table in the
[RISC-V reference manual](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf)
page 130, this is determined by the two LSBs of `funct3`:

```verilog
   wire mem_byteAccess     = funct3[1:0] == 2'b00;
   wire mem_halfwordAccess = funct3[1:0] == 2'b01;

   wire [31:0] LOAD_data =
         mem_byteAccess ? LOAD_byte     :
     mem_halfwordAccess ? LOAD_halfword :
                          mem_rdata     ;
```

Now we need to insert sign expansion into this expression. The value to be
written in the MSBs of `rd`, `LOAD_sign`, depends on both whether the
instruction does sign expansion (`LB`,`LH`), characterized by `funct3[2]=0`,
and the MSB of the loaded value:

```verilog
   wire LOAD_sign =
	!funct3[2] & (mem_byteAccess ? LOAD_byte[7] : LOAD_halfword[15]);

   wire [31:0] LOAD_data =
         mem_byteAccess ? {{24{LOAD_sign}},     LOAD_byte} :
     mem_halfwordAccess ? {{16{LOAD_sign}}, LOAD_halfword} :
                          mem_rdata ;
```

Pfiuuuu, it was a bit painful, but in the end it is not too complicated.
My initial design was much more complicated, but Matthias Koch (@mecrisp) simplified
it a lot, resulting in the (reasonably easy to understand) design above.

We are not completely done though, now we need to modify the state machine. It will have
two additional states, `LOAD` and `WAIT_DATA`:

```verilog
   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam FETCH_REGS  = 2;
   localparam EXECUTE     = 3;
   localparam LOAD        = 4;
   localparam WAIT_DATA   = 5;
   reg [2:0] state = FETCH_INSTR;
```

_Note 1_ we could do with a smaller number of states, but for now our goal is to have
something that works and that is as easy to understand as possible. We will see later
how to simplify the state machine.
_Note 2_ do not forget to check that `state` has the required number of bits !
(`reg [2:0] state` instead of `reg [1:0] state` as before !!). Then the new
states are plugged in as follows:

```verilog
     ...
	   EXECUTE: begin
	      if(!isSYSTEM) begin
		 PC <= nextPC;
	      end
	      state <= isLoad ? LOAD : FETCH_INSTR;
	   end
	   LOAD: begin
	      state <= WAIT_DATA;
	   end
	   WAIT_DATA: begin
	      state <= FETCH_INSTR;
	   end

     ...
```

And finally, the signals `mem_addr` (with the address to be read)
and `mem_rstrb` (that goes high whenever the processor wants to read data) are
driven as follows:

```verilog
   assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ?
		     PC : loadstore_addr ;
   assign mem_rstrb = (state == FETCH_INSTR || state == LOAD);
```

Let us test now our new instructions with the following program:
```verilog
   integer L0_   = 8;
   integer wait_ = 32;
   integer L1_   = 40;

   initial begin
      LI(s0,0);
      LI(s1,16);
   Label(L0_);
      LB(a0,s0,400); // LEDs are plugged on a0 (=x10)
      CALL(LabelRef(wait_));
      ADDI(s0,s0,1);
      BNE(s0,s1, LabelRef(L0_));
      EBREAK();

   Label(wait_);
      LI(t0,1);
      SLLI(t0,t0,slow_bit);
   Label(L1_);
      ADDI(t0,t0,-1);
      BNEZ(t0,LabelRef(L1_));
      RET();

      endASM();

      // Note: index 100 (word address)
      //     corresponds to
      // address 400 (byte address)
      MEM[100] = {8'h4, 8'h3, 8'h2, 8'h1};
      MEM[101] = {8'h8, 8'h7, 8'h6, 8'h5};
      MEM[102] = {8'hc, 8'hb, 8'ha, 8'h9};
      MEM[103] = {8'hff, 8'hf, 8'he, 8'hd};
   end
```
This program initializes some values in four words
at address 400, and loads them in `a10` in a loop.
There is also a delay loop (`wait` function) to let
you see something, just as before.

**Try this** Run the program in simulation and on the device.
Test the other instructions. Do a programmable tinsel as in step 3.

**You are here !** Just three instructions to go and we will be done !

| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [*] 2 | [*] 6  | [*] | [*]   | [*] 5 | [ ] 3 | [*] 1  |

## Step 16: Store

We are approaching the end, but still some work to do, to implement
the following three instructions:

 | Instruction     | Effect                                  |
 |-----------------|-----------------------------------------|
 | SW(rs2,rs1,imm) | store rs2 at address rs1+imm            |
 | SB(rs2,rs1,imm) | store 8 LSBs of rs2 at address rs1+imm  |
 | SH(rs2,rs1,imm) | store 16 LSBs of rs2 at address rs1+imm |

To do so, we will need to do three different things:
- modify the interface between the processor and the memory in
  such a way that the processor can write to the memory
- the memory is addressed by words. Each write operation will
  modify a word. But `SB` and `SH` need to be able to
  write individual bytes. Besides the word to be written,
  we need to compute which byte of this word should be
  effectively modified in memory (a 4-bits mask)
- the state machine needs to be modified.


The `Memory` module is modified as follows:

``` verilog
module Memory (
   input             clk,
   input      [31:0] mem_addr,
   output reg [31:0] mem_rdata,
   input   	     mem_rstrb,
   input      [31:0] mem_wdata,
   input      [3:0]  mem_wmask
);

   reg [31:0] MEM [0:255];

   initial begin
      ...
   end

   wire [29:0] word_addr = mem_addr[31:2];
   always @(posedge clk) begin
      if(mem_rstrb) begin
         mem_rdata <= MEM[word_addr];
      end
      if(mem_wmask[0]) MEM[word_addr][ 7:0 ] <= mem_wdata[ 7:0 ];
      if(mem_wmask[1]) MEM[word_addr][15:8 ] <= mem_wdata[15:8 ];
      if(mem_wmask[2]) MEM[word_addr][23:16] <= mem_wdata[23:16];
      if(mem_wmask[3]) MEM[word_addr][31:24] <= mem_wdata[31:24];
   end
```


We have two new input signals: `mem_wdata`, a 32-bits signal
with the value to be written, and `mem_wmask` a 4-bits signal
that indicates which byte should be written.

_Note_ you may wonder how it is implemented in practice, in particular
how the masked write to memory is synthesized on the device. BRAMs on
most FPGAs directly support masked writes, through vendor's special
primitives. Yosys has a (super smart) special step called "technology mapping" that
detects some patterns in the source VERILOG file, and instances
the vendor's primitive best adapted to the usage. In fact technology mapping
was used before in our tutorial, to represent the registers bank: at each
cycle we read two registers, `rs1` and `rs2`. In the IceStick, BRAMs can
read a single value at each clock, so to make it possible, yosys automatically
duplicates the register bank. Whenever a value is written to `rd`, it is written to
the two register banks: `bank1[rdId] <- writeBackValue; bank2[rdId] <- writeBackValue;`,
and two different registers can be read at the same cycle, each one in its own
register bank `rs1 <- bank1[rs1Id]; rs2 <- bank2[rs2Id;`. With the magic of Yosys,
you do not have to take care of this, it will automatically select the best
mapping for you (duplicated register bank, single register bank with two read
ports if target supports it, or even array of flipflops with address decoder
for larger FPGAs with many LUTs). In our case, the IceStick has an Ice40HX1K,
that has 8 kB of BRAM, organized in 8 blocks of 1 kB each. Two of them are
used for the (duplicated) register bank, leaving 6 kB of BRAM that we use
to synthesize system RAM.

The `Processor` module is updated accordingly:
```verilog
module Processor (
    input 	      clk,
    input 	      resetn,
    output [31:0]     mem_addr,
    input [31:0]      mem_rdata,
    output 	      mem_rstrb,
    output [31:0]     mem_wdata,
    output [3:0]      mem_wmask,
    output reg [31:0] x10 = 0
);
```

(and everything is connected in the `SOC`).

Let us see now how to compute the word to be written and the mask. The
address where the value should be written is still `rs1 + imm`, but
the format of the immediate value is different between `Load` (`Iimm`)
and `Store` (`Simm`):
```
   wire [31:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm);
```

Now the data to be written depends on whether we write a byte, a halfword
or a word, and for bytes and halfwords, also depends on the 2 LSBs of
the address. Interestingly, we do not need to test whether we write a
byte, a halfword or a word, because the write mask (see lated) will
ignore MSBs for byte and halfword write:
```
   assign mem_wdata[ 7: 0] = rs2[7:0];
   assign mem_wdata[15: 8] = loadstore_addr[0] ? rs2[7:0]  : rs2[15: 8];
   assign mem_wdata[23:16] = loadstore_addr[1] ? rs2[7:0]  : rs2[23:16];
   assign mem_wdata[31:24] = loadstore_addr[0] ? rs2[7:0]  :
			     loadstore_addr[1] ? rs2[15:8] : rs2[31:24];
```

And finally, the 4-bits write mask, that indicate which byte of `mem_wdata`
should be effectively written to memory. It is determined as follows:

| write mask                                   | Instruction
|----------------------------------------------|------------------------------------------|
| `4'b1111`                                    | `SW`                                     |
| `4'b0011` or `4'b1100`                       | `SH`, depending on `loadstore_addr[1]`   |
| `4'b0001`, `4'b0010`, `4'b0100` or `4'b1000` | `SB`, depending on `loadstore_addr[1:0]` |

Deriving the expression is a bit painful. With Matthias Koch we ended up with this one:

```verilog
   wire [3:0] STORE_wmask =
	      mem_byteAccess      ?
	            (loadstore_addr[1] ?
		          (loadstore_addr[0] ? 4'b1000 : 4'b0100) :
		          (loadstore_addr[0] ? 4'b0010 : 4'b0001)
                    ) :
	      mem_halfwordAccess ?
	            (loadstore_addr[1] ? 4'b1100 : 4'b0011) :
              4'b1111;
```

Let us now create additional states in the state machine:
```verilog
   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam FETCH_REGS  = 2;
   localparam EXECUTE     = 3;
   localparam LOAD        = 4;
   localparam WAIT_DATA   = 5;
   localparam STORE       = 6;

   ...

   always @(posedge clk) begin
   ...
       case(state)
           ...
	   EXECUTE: begin
	      if(!isSYSTEM) begin
		 PC <= nextPC;
	      end
	      state <= isLoad  ? LOAD  :
		       isStore ? STORE :
		       FETCH_INSTR;
	   LOAD: begin
	      state <= WAIT_DATA;
	   end
	   WAIT_DATA: begin
	      state <= FETCH_INSTR;
	   end
	   STORE: begin
	      state <= FETCH_INSTR;
	   end
	 endcase
      end
   end
```

The signals interfaced with the memory as driven as follows:
```verilog
   assign mem_addr = (state == WAIT_INSTR || state == FETCH_INSTR) ?
		     PC : loadstore_addr ;
   assign mem_rstrb = (state == FETCH_INSTR || state == LOAD);
   assign mem_wmask = {4{(state == STORE)}} & STORE_wmask;
```

And, at last, a little thing: do not write back to register bank if instruction
is a `Store` !
```verilog
   assign writeBackEn = (state==EXECUTE && !isBranch && !isStore && !isLoad) ||
			(state==WAIT_DATA) ;
```
_Note_ The `!isLoad` term that prevents writing `rd` during `EXECUTE` can be removed from the condition,
since `rd` will be overwritten right after during the `WAIT_DATA`. It is there to have something easier
to understand with simulations.

**try this** Run [step16.v](step16.v) in simulation and on the device. It copies 16 bytes from address 400
to address 800, then displays the values of the copied bytes.

**You are here !** Congratulations ! You have finished implementing your first RV32I RISC-V core !

| ALUreg | ALUimm | Jump  | Branch | LUI | AUIPC | Load  | Store | SYSTEM |
|--------|--------|-------|--------|-----|-------|-------|-------|--------|
| [*] 10 | [*] 9  | [*] 2 | [*] 6  | [*] | [*]   | [*] 5 | [*] 3 | [*] 1  |

_But wait a minute_ for sure we have worked a lot to implement a RISC-V core, but all what I can see know
is just something that looks like the stupid blinky at step 1 ! I want to see more !

To do so, we need to let our device communicate with the outside word with more than 5 LEDs.

## Step 17: Memory-mapped device - let's do (much) more than a blinky !

Now the idea is to add devices to our SOC. We already have LEDs, that are plugged to
register `a0` (`x10`). Plugging devices on a register like that is not super elegant, it would
be better to have a special address in memory that is not really actual RAM but that has
a register plugged to the LEDs. With this idea, one can add as many devices as he likes, by
assigning a virtual address to each device. Then the SOC will have address decoding hardware
that routes the data to the right device. As you will see, besides removing from the processor
the wires drawn from `x10` to the LEDS, this only requires some small modifications in the SOC.

Before starting to modify the SOC, the first thing to do is to decide
about the "memory map", that is, which address space portion
corresponds to what. In our system, we have 6 kB of RAM, so in
practice we could say that addresses between 0 and 2^13-1 (8 kB, let
us keep a power of two) correspond to RAM. I decided to use a larger
portion of address space for RAM (because we also have FPGAs that have
ampler quantities of BRAM), then the address space dedicated to RAM
will be between 0 and 2^22-1 (that is, 4 MB of RAM).

Then, I decided to say that if bit 22 is set in an address, then this address
corresponds to a device. Now we need to specify how to select among multiple
devices. A natural idea is to use bits 0 to 21 as a "device index", but doing
so is going to require multiple 22-bits wide comparators, and on our IceStick,
it will eat-up a significant portion of the removing LUTs. A better idea,
suggested (once again) by Matthias Koch (@mecrisp), is to use 1-hot encoding,
that is, data is routed to device number `n` if bit `n` is set in the address.
We will only consider "word addresses" (that is, ignore the two LSBs).
Doing that, we can only plug 20 different devices to our SOC, but it is still
much more than what we need. The advantage is that it dramatically simplifies
address decoding, in such a way that everything still fits in the IceStick.

To determine whether a memory request should be routed to the RAM or to the
devices, we insert the following circuitry into the SOC:
```verilog
   wire [31:0] RAM_rdata;
   wire [29:0] mem_wordaddr = mem_addr[31:2];
   wire isIO  = mem_addr[22];
   wire isRAM = !isIO;
   wire mem_wstrb = |mem_wmask;
```

The RAM is wired as follows:
```verilog
   Memory RAM(
      .clk(clk),
      .mem_addr(mem_addr),
      .mem_rdata(RAM_rdata),
      .mem_rstrb(isRAM & mem_rstrb),
      .mem_wdata(mem_wdata),
      .mem_wmask({4{isRAM}}&mem_wmask)
   );
```
(note the `isRAM` signal ANDed with the write mask)

Now we can add the logic to wire our LEDs. They are
declared as a `reg` in the SOC module interface:
```verilog
module SOC (
    input 	     CLK,
    input 	     RESET,
    output reg [4:0] LEDS,
    input 	     RXD,
    output 	     TXD
);
```

driven by a simple block:
```verilog
   localparam IO_LEDS_bit = 0;

   always @(posedge clk) begin
      if(isIO & mem_wstrb & mem_wordaddr[IO_LEDS_bit]) begin
	 LEDS <= mem_wdata;
      end
   end
```

Now we can write (yet another version of) our old good blinky:
```verilog
      LI(gp,32'h400000);
      LI(a0,0);
   Label(L1_);
      SW(a0,gp,4);
      CALL(LabelRef(wait_));
      ADDI(a0,a0,1);
      J(LabelRef(L1_));
```

First we load the base address of the IO page in `gp` (that is, `2^22`). To write
LEDs value, we store `a0` to word address 1 (that is address 4) in the IO page.
To make things easier when we'll have several devices (right after), let us write
some helper functions:

```verilog
   // Memory-mapped IO in IO page, 1-hot addressing in word address.
   localparam IO_LEDS_bit      = 0;  // W five leds

   // Converts an IO_xxx_bit constant into an offset in IO page.
   function [31:0] IO_BIT_TO_OFFSET;
      input [31:0] bit;
      begin
	 IO_BIT_TO_OFFSET = 1 << (bit + 2);
      end
   endfunction
```

Then we can write to the LEDs as follows:

```verilog
   SW(a0,gp,IO_BIT_TO_OFFSET(IO_LEDS_bit));
```

_OK, is it all what you have, still your stupid blinky after 17 (!) tutorial steps ?_

Sure, you are right man. Let us add an UART to allow our core to display stuff to a
virtual terminal. The IceStick (and many other FPGA boards) has a special chip
(FTDI2232H if you want to know), that
translates between the plain old RS232 serial protocol and USB. It is good news for
us, because RS232 is a simple protocol, much easier to implement than USB. In fact,
our core will communicate with the outside word through two pins (one for sending
data, called `TXD` and one for receiving data, called `RXD`), and the FTDI chip
converts to the USB protocol for you. Moreover, it is a good idea not reinventing
the wheel, and there are many existing implementation of UART
(Universal Asynchronous Receiver Transmitter, that implement the RS232 protocol)
in VERILOG. For our
purpose, for now we will only implement half of it (that is, the part that lets
our processor send data over it to display text in a terminal emulator).

Olof Kindren has written a [Tweet-size UART](https://twitter.com/OlofKindgren/status/1409634477135982598),
more legible version [here](https://gist.github.com/olofk/e91fba2572396f55525f8814f05fb33d).

Let us insert it into our SOC and connect it:

```verilog
   // Memory-mapped IO in IO page, 1-hot addressing in word address.
   localparam IO_LEDS_bit      = 0;  // W five leds
   localparam IO_UART_DAT_bit  = 1;  // W data to send (8 bits)
   localparam IO_UART_CNTL_bit = 2;  // R status. bit 9: busy sending
   ...

   wire uart_valid = isIO & mem_wstrb & mem_wordaddr[IO_UART_DAT_bit];
   wire uart_ready;

   corescore_emitter_uart #(
      .clk_freq_hz(`BOARD_FREQ*1000000),
      .baud_rate(115200)
   ) UART(
      .i_clk(clk),
      .i_rst(!resetn),
      .i_data(mem_wdata[7:0]),
      .i_valid(uart_valid),
      .o_ready(uart_ready),
      .o_uart_tx(TXD)
   );

   wire [31:0] IO_rdata =
	       mem_wordaddr[IO_UART_CNTL_bit] ? { 22'b0, !uart_ready, 9'b0}
	                                      : 32'b0;

   assign mem_rdata = isRAM ? RAM_rdata :
	                      IO_rdata ;

```

The UART is projected onto two different addresses in memory space. The first
one, that can be only written to, sends one character. The second one, that can
be only read from, indicates whether the UART is ready (bit 9 = 0) or busy
sending a character (bit 9 = 1).


Now our processor has more possibilities to communicate with the outside world
than the poor five LEDs we had before ! Let us implement a function to send
a character:

```verilog
   Label(putc_);
      // Send character to UART
      SW(a0,gp,IO_BIT_TO_OFFSET(IO_UART_DAT_bit));
      // Read UART status, and loop until bit 9 (busy sending)
      // is zero.
      LI(t0,1<<9);
   Label(putc_L0_);
      LW(t1,gp,IO_BIT_TO_OFFSET(IO_UART_CNTL_bit));
      AND(t1,t1,t0);
      BNEZ(t1,LabelRef(putc_L0_));
      RET();
```

It writes the character to the UART address projected in IO space, then loops while
the UART status indicates that it is busy sending a character.

**Try this** run [step17.v](step17.v) in simulation.

_Wait a minute_ in simulation, how does it know how to display something ?

It's because I cheated a bit, I added the following block of code to the SOC:
```verilog
`ifdef BENCH
   always @(posedge clk) begin
      if(uart_valid) begin
	 $write("%c", mem_wdata[7:0] );
	 $fflush(32'h8000_0001);
      end
   end
`endif
```
(the magic constant argument to`$fflush()` corresponds to `stdout`, you need to
do that else you do not see anything on the terminal until the output buffer
of `stdout` is full). Doing so we do not test the UART in simulation (it is completely bypassed).
I trust Olof that it works fine, but to do things properly, it would be better to plug something
on the simulated `TXD` signal, decode the RS232 protocol and display the characters (we'll see
examples of this type of simulation later on).

**Try this** run [step17.v](step17.v) on device.

To display what's sent to the UART, use:
```
  $ ./terminal.sh
```
_Note_ edit `terminal.sh` and chose your favourite terminal emulator in there. You may also
need to change `DEVICE=/dev/ttyUSB1` according to your local configuration.

## Step 18: Computing the Mandelbrot set

Now that we have a functional RISC-V processor and a SOC with an UART that can send characters
to a virtual terminal, let us rest a little bit with a purely software step. In this step, we
are going to write a program in RISC-V assembly that computes a crude, ASCII-art version of
the Mandelbrot set.

Our "image" will be made of 80x80 characters. So let us start by writing a program that fills
the image with "*" characters. To do that, we will use two nested loops. The Y coordinate
will be stored in `s0` and the X coordinate in `s1`. The upper bound (80) will be stored
in `s11`. The program looks like that:

```verilog
      LI(gp,32'h400000); // IO page
      LI(s1,0);
      LI(s11,80);

   Label(loop_y_);
      LI(s0,0);

   Label(loop_x_);
      LI(a0,"*");
      CALL(LabelRef(putc_));

      ADDI(s0,s0,1);
      BNE(s0,s11,LabelRef(loop_x_));

      LI(a0,13);
      CALL(LabelRef(putc_));
      LI(a0,10);
      CALL(LabelRef(putc_));

      ADDI(s1,s1,1);
      BNE(s1,s11,LabelRef(loop_y_));

      EBREAK();
```
(and we copy the `putc` function from the previous example).

**Fixed point** So now we want to compute the Mandelbrot set. To do that, we need to manipulate real numbers.
Unfortunately, our super simplistic RISC-V core is not able to directly manipulate floating point
numbers. The C compiler's support library `libgcc` has some functions to support them, but we will
see later how to use them. For now, the idea is to compute the Mandelbrot set using fixed-point
numbers, that is, in an integer number, we will use some bits to represent the fractional part
(10 bits in our case), and some bits to represent the integer parts (22 bits in our case). In other
words, it means that if we want to represent a real number `x`, we will store (the integer part of)
`x*2^10` in a register. It is similar to floating point numbers, except that the exponent in our
case is always 10. We will use the following constants in our program:

```verilog
   `define mandel_shift 10
   `define mandel_mul (1 << `mandel_shift)
```

Now, to compute the sum or the difference of two numbers, it does not change anything, because
the `2^10` factor is the same for both numbers to be added (or subtracted). For a product it
is a different story, because when you compute `x*y`, the actual computation that you do is
`x*2^10*y*2^10`, so what you get is `(x*y)*2^20`, and you wanted `(x*y)*2^10`, so you need to
divide by `2^10` (right shift by `10`). OK, that's good, but how do we compute the product
of two integer numbers stored in two registers ? Our processor has no `MUL` instruction ? In fact
it is possible to add a `MUL` instruction (it is part of the RV32M instruction set), we will see
that later, but it will not fit within our tiny IceStick ! So what can we do ? We can implement
a function that takes two numbers in `a0` and `a1`, computes their products and returns it in `a0`.
The C compiler support library `libgcc` has one (it is what is used when compiling C for small
RV32I RISC-V processors that do not have the `MUL` instruction, like ours). The source-code of
this function is [here](https://github.com/riscv-collab/riscv-gcc/blob/5964b5cd72721186ea2195a7be8d40cfe6554023/libgcc/config/riscv/muldi3.S).
Let us port it to our VERILOG RISC-V assembler (that has a slightly different syntax unfortunately,
we will see later how to directly use gcc and gas):

```verilog
      // Mutiplication routine,
      // Input in a0 and a1
      // Result in a0
   Label(mulsi3_);
      MV(a2,a0);
      LI(a0,0);
   Label(mulsi3_L0_);
      ANDI(a3,a1,1);
      BEQZ(a3,LabelRef(mulsi3_L1_));
      ADD(a0,a0,a2);
   Label(mulsi3_L1_);
      SRLI(a1,a1,1);
      SLLI(a2,a2,1);
      BNEZ(a1,LabelRef(mulsi3_L0_));
      RET();
```
(do not forget to declare the new labels before the `initial` block).

So now, before displaying the Mandelbrot set, to test our fixed-point
computation idea, let us display a simpler shape, that is, we consider
we are visualizing the `[-2.0,2.0]x[-2.0,2.0]` square (mapped to our
30x30 characters display), and we want to display a disk of radius `2`
centered on `(0,0)`. To do that, we need first to compute the (fixed point)
coordinates `x,y`. They will be stored in `s2` and `s3`. Then we need to
compute `x^2+y^2`. We can do that by invoking the `mulsi3` routine twice
(do not forget to rightshift the result by 10). Finally, we compare
the result with `4 << 10` (4 because it is the _squared_ radius, and shifted
to the left by 10 because of our fixed-point representation), to decide
whether the point was inside or outside the disk, and use a different character
to display it. The corresponding program looks like that:

```verilog
   `define mandel_shift 10
   `define mandel_mul (1 << `mandel_shift)
   `define xmin (-2*`mandel_mul)
   `define xmax ( 2*`mandel_mul)
   `define ymin (-2*`mandel_mul)
   `define ymax ( 2*`mandel_mul)
   `define dx ((`xmax-`xmin)/30)
   `define dy ((`ymax-`ymin)/30)
   `define norm_max (4 << `mandel_shift)

   integer    loop_y_      = 28;
   integer    loop_x_      = 36;
   integer    in_disk_     = 92;

   initial begin
      LI(gp,32'h400000); // IO page

      LI(s1,0);
      LI(s3,`xmin);
      LI(s11,30);
      LI(s10,`norm_max);

   Label(loop_y_);
      LI(s0,0);
      LI(s2,`ymin);

   Label(loop_x_);

      MV(a0,s2);
      MV(a1,s2);
      CALL(LabelRef(mulsi3_));
      SRLI(s4,a0,`mandel_shift); // s4 = x*x
      MV(a0,s3);
      MV(a1,s3);
      CALL(LabelRef(mulsi3_));
      SRLI(s5,a0,`mandel_shift); // s5 = y*y
      ADD(s6,s4,s5);             // s6 = x*x+y*y
      LI(a0,"*");
      BLT(s6,s10,LabelRef(in_disk_)); // if x*x+y*y < 4
      LI(a0," ");
  Label(in_disk_);
      CALL(LabelRef(putc_));

      ADDI(s0,s0,1);
      ADDI(s2,s2,`dx);
      BNE(s0,s11,LabelRef(loop_x_));

      LI(a0,13);
      CALL(LabelRef(putc_));
      LI(a0,10);
      CALL(LabelRef(putc_));

      ADDI(s1,s1,1);
      ADDI(s3,s3,`dy);
      BNE(s1,s11,LabelRef(loop_y_));

      EBREAK();
```

and the output looks like that:
```
          ***********
        ***************
       ******************
     *********************
    ***********************
    ************************
   *************************
  ***************************
  ***************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
 *****************************
  ***************************
  ***************************
   *************************
   *************************
    ***********************
     *********************
      *******************
        ***************
          ***********
```

Now to compute the Mandelbrot set, we need to iterate the following operation:
```
   Z <- 0; iter <- 0
   do
      Z <- Z^2 + C
      iter <- iter + 1
   while |Z| < 2
```
where `Z` and `C` are complex numbers. `C = x + iy` corresponds to the current pixel.
Remember the rule for complex number multiplication (`i*i = -1`), we can compute
`Z^2 = (Zr + i*Zi)^2 = Zr^2-Zi^2 + 2*i*Zr*Zi`. The loop that computes these iterates
writes:
```verilog
   Label(loop_Z_);
      MV(a0,s4); // Zrr  <- (Zr*Zr) >> mandel_shift
      MV(a1,s4);
      CALL(LabelRef(mulsi3_));
      SRLI(s6,a0,`mandel_shift);
      MV(a0,s4); // Zri <- (Zr*Zi) >> (mandel_shift-1)
      MV(a1,s5);
      CALL(LabelRef(mulsi3_));
      SRAI(s7,a0,`mandel_shift-1);
      MV(a0,s5); // Zii <- (Zi*Zi) >> (mandel_shift)
      MV(a1,s5);
      CALL(LabelRef(mulsi3_));
      SRLI(s8,a0,`mandel_shift);
      SUB(s4,s6,s8); // Zr <- Zrr - Zii + Cr
      ADD(s4,s4,s2);
      ADD(s5,s7,s3); // Zi <- 2Zri + Cr

      ADD(s6,s6,s8); // if norm > norm max, exit loop
      LI(s7,`norm_max);
      BGT(s6,s7,LabelRef(exit_Z_));

      ADDI(s10,s10,-1);  // iter--, loop if non-zero
      BNEZ(s10,LabelRef(loop_Z_));

   Label(exit_Z_);
```

in the end, we display different characters depending on the value
of `iter` (`s10`) when the loop is exited:
```
   Label(exit_Z_);
      LI(a0,colormap_);
      ADD(a0,a0,s10);
      LBU(a0,a0,0);
      CALL(LabelRef(putc_));
```
where the "colormap" is an array of characters that mimic
different "intensities", from the darkest to the brightest:
```
   Label(colormap_);
      DATAB(" ",".",",",":");
      DATAB(";","o","x","%");
      DATAB("#","@", 0 , 0 );
```

![](https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/TUTORIALS/Images/mandelbrot_terminal.gif)

**Try that** run [step18.v](step18.v) in simulation and on the device. Modify it to draw your own graphics (for instance,
try drawing "concentric circles" using the "colormap").

## Step 19: Faster simulation with Verilator

As you have seen in Step 18, simulation is much much slower than running the design on the device. However, there is
another tool, called `verilator`, that lets you convert a VERILOG design into C++. Then you compile the C++, and you
have a simulation that is much much faster than icarus/iverilog. Let us first install verilator:
```
  $ apt-get install verilator
```

Before transforming our design into C++, we will have to create a "bench", that is, some C++ code that will generate the
signals for our design, and that will declare the C++ `main()` function. The main role of the main function is to declare
an object of class `VSOC` (generated from our `SOC` module), and wiggle its `CLK` signal. Each time the `CLK` signal is
changed, you need to call the `eval()` function to take the change into account. The `sim_main.cpp` file is as follows:

```c++
#include "VSOC.h"
#include "verilated.h"
#include <iostream>

int main(int argc, char** argv, char** env) {
   VSOC top;
   top.CLK = 0;
   while(!Verilated::gotFinish()) {
      top.CLK = !top.CLK;
      top.eval();
   }
   return 0;
}
```

In addition, in [sim_main.cpp](sim_main.cpp), there is some code to decode whenever the LEDs change, and display their
status.

To convert a design to C++, use the following command:
```
  $ verilator -DBENCH -DBOARD_FREQ=12 -Wno-fatal --top-module SOC -cc -exe sim_main.cpp step18.v
```

Then to compile the C++ and run the generated program:
```
  $ cd obj_dir
  $ make -f VSOC.mk
  $ ./VSOC
```

As you can see, it is much much faster than icarus/iverilog ! For a small design, it does not make a huge difference, but believe me, when you
are developping an RV32IMFC core, with a FPU, it is good to have efficient simulation !

To make things easier, there is a `run_verilator.sh` script, that you can invoke as follows:
```
  $ run_verilator.sh step18.v
```

## Step 20: Using the GNU toolchain to compile programs - assembly

At this step, you may have the feeling that our RISC-V design
is just a toy, for educational purpose, far away from "the real thing".
In fact, at this step, you will start feeling that what you have done
is as real as any other RISC-V processor ! What makes a processor interesting
is the software you can run on it, hence if our thingy can run any software
written for a (RV32I) RISC-V processor, then it is a RV32I RISC-V processor.

_Wait a minute_ but what we have used up to now to write the software is
the VERILOG assembler, it is just a toy, different from the real thing no ?

In fact, the VERILOG assembler generates exactly the same machine code as
any other RISC-V assembler. We coud use instead any other RISC-V assembler,
load the generated machine code into our design and run it !

To do so, VERILOG has a `$readmemh()` command, that loads the data to
initialize a memory from an external file. It is used as follows in
[step20.v](step20.v):

```verilog
   initial begin
       $readmemh("firmware.hex",MEM);
   end
```

where `firmware.hex` is an ASCII file with the initial content of `MEM` in hexadecimal.

So if we want to use an external assembler, all we have to do is figure out the following
things:
- how to compile RISC-V assembly code using GNU tools
- how to tell GNU tools about the device we have created (RAM start address, RAM amount)
- how to convert the output of GNU tools into a file that `$readmemh()` can understand

OK, let us start with a simple blinker, in [blinker.S](FIRMWARE/blinker.S):

```
# Simple blinker

.equ IO_BASE, 0x400000
.equ IO_LEDS, 4

.section .text

.globl start

start:
        li   gp,IO_BASE
	li   sp,0x1800
.L0:
	li   t0, 5
	sw   t0, IO_LEDS(gp)
	call wait
	li   t0, 10
	sw   t0, IO_LEDS(gp)
	call wait
	j .L0

wait:
        li t0,1
	slli t0, t0, 17
.L1:
        addi t0,t0,-1
	bnez t0, .L1
	ret
```

As you can see, it is very similar to the code we wrote up to now in the
VERILOG assembler. In this program, we have three different things:
- **main program**
- **utilities**, here the `wait` function
- **setup**, that is, initializing `gp` and `sp`

So we will split the file into three parts:
- [FIRMWARE/blinker.S](FIRMWARE/blinker.S) with the `main` function
- [FIRMWARE/wait.S](FIRMWARE/wait.S) with the `wait` function
- [FIRMWARE/start.S](FIRMWARE/start.S) with the setup code, that calls `main` in the end.

To compile it, you will need to install the RISC-V toolchain (compiler,
assembler, linker) on your machine. Our makefile can do that for you:

```
  $ cd learn-fpga/FemtoRV
  $ make ICESTICK.firmware_config
```
_Note:_ always use `ICESTICK.firmware_config`, even if you have a larger board,
it will configure the makefiles for `RV32I` build (and that's what our processor
supports).

This will download some files and unpack them in `learn-fpga/FemtoRV/FIRMWARE/TOOLCHAIN`.
Add the `riscv64-unknown-elf-gcc..../bin/` directory to your path.

Now to compile our program:
```
  $ cd learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/FIRMWARE
  $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax start.S -o start.o
  $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax blinker.S -o blinker.o
  $ riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -mno-relax wait.S -o wait.o
```
We specify the architecture (`rv32i`) that corresponds to the instructions
supported by our processor and the ABI (`ilp32`) that corresponds to the way functions
are called. THe `no-relax` option concerns the `gp` register that we use for
accessing the IO page (so we do not let the assembler use it for anything else).

This generates object files (`.o`). We now need to generate an executable from them,
by invoking the linker. The linker will determine where our code and data should
be implanted in memory. For that, we need to specify how the memory in our
device is organized, in a linker script ([FIRMWARE/bram.ld](FIRMWARE/bram.ld)):

```
MEMORY
{
   BRAM (RWX) : ORIGIN = 0x0000, LENGTH = 0x1800  /* 6kB RAM */
}
SECTIONS
{
    everything :
    {
	. = ALIGN(4);
	start.o (.text)
        *(.*)
    } >BRAM
}
```

A linker script contains a description of `MEMORY`. In our case, there is a single
segment of 6 kB of memory, that we call `BRAM`. It starts from address `0x0000`.
Then we have `SECTIONS`, that indicates what goes where (or which segment goes
to which memory). In our case, it is super simple: everything goes to BRAM.
We also indicate that the content of `start.o` should be installed first in memory.
The linker is invoked as follows:

```
  $ riscv64-unknown-elf-ld blinker.o wait.o -o blinker.bram.elf -T bram.ld -m elf32lriscv -nostdlib -norelax
```

It generates an "elf" executable ("elf" stands for Executable and Linkable Format). It is the
same format as the binaries in a Linux system. The option
`-T bram.ld` tells it to use our linker script. The option `-m elf32lriscv` indicates that
we are generating a 32-bits executable. We are not using the C stdlib for now (`-nostdlib`) and
we keep `gp` for ourselves (`-norelax`). We do not need to have `start.o` on the command line
in the list of objects to link, because it is already included in the linker script `bram.ld`.

We are not completely done, now we need to extract the relevant information from the elf executable,
and generate a file with all the machine code in hexadecimal, so that VERILOG's `$readmemh()` function
can understand it. For that, I wrote a `firmware_words` utility, that understands the elf file formats,
extracts the parts that are interesting for us and writes them in ASCII hexadecimal:

```
  $ make blinker.bram.hex
```

_Note_ you can invoke `make xxxx.bram.hex` directly, it will invoke the assembler, linker and
elf conversion utility for you automatically.


Now you can run the example in simulation and on the device:
```
  $ cd ..
  $ ./run_verilator.sh step20.v
  $ BOARDS/run_xxx.sh step20.v
```

Now that things are easier, we can write more complicated programs. Let us see how
to write the famous "hello world" program. What we need is a `putstring` routine to display
a string on the tty. It takes as input the address of the first character of the string
to display in `a0`.  We just need to loop on all characters of the string, and
exit the loop as soon as we find a null character, and call `putchar` for each character:
```
# Warning, buggy code ahead !
putstring:
	mv t2,a0
.L2:    lbu a0,0(t2)
	beqz a0,.L3
	call putchar
	addi t2,t2,1
	j .L2
.L3:    ret
```
Have you seen the comment ? It means the code above has an error, can you spot it ?

A hint, `putstring` is a function that calls a function. Don't we need to do special
in this case ?

Do you remember what `call` and `ret` do ? Yes, `call` stores `PC+4` in `ra` then
jumps to the function, and `ret` jumps to the address in `ra`. Now suppose that
somebody called our `putstring` function. When we enter the function, `ra` contains
the address we are supposed to jump to when reaching the `ret` statement in `putstring`.
But inside `putstring`, we call `putchar`, and it overwrites `ra` with the address right
after the call, so that `putchar` will be able to jump there when it will return, but
`putstring` will jump there as well, which is not what we want. To avoid that, we need
to save `ra` at the beginning of `putstring`, and restore it at the end. To do that,
we use the stack as follows:

```
putstring:
	addi sp,sp,-4 # save ra on the stack
	sw ra,0(sp)   # (need to do that for functions that call functions)
	mv t2,a0
.L2:    lbu a0,0(t2)
	beqz a0,.L3
	call putchar
	addi t2,t2,1
	j .L2
.L3:    lw ra,0(sp)  # restore ra
	addi sp,sp,4 # resptore sp
	ret
```

The function can be used as follows:
```
   la   a0, hello
   call putstring

   ...

hello:
	.asciz "Hello, world !\n"
```

The `la` (load address) pseudo-instruction loads the address of the string
in `a0`. The string is declared with a standard label, and the `.asciz`
directive that generates a zero-terminated string.

**Try this** Compile `hello.S` (`cd FIRMWARE; make hello.bram.hex`) and test it in simulation and on device.
Try also `mandelbrot.S`. As you can see, [FIRMWARE/mandelbrot.S](FIRMWARE/mandelbrot.S) does not have
the `__mulsi` function. If you take a look at [FIRMWARE/Makefile](FIRMWARE/Makefile), the executable is
linked with the right version of `libgcc.a` (for RV32I), that has it.

Now you can start having a feeling that your processor is a real thing: when you run the Mandelbrot example,
it executes code on *your* processor that was written by *somebody else*. Can we go further and run code
generated by standard tools ?

## Step 21: Using the GNU toolchain to compile programs - C

Let us see now how we can write code in C for our processor. At this point, we are able to
generate object files (`.o`) and produce an elf executable from them using the linker. Our
linker script ensures that everything goes at the right place in memory, then our processor
can execute the code, first the content of `start.S`, implanted at address 0, that calls in
turn the `main` function. Up to now our programs were completely written in assembly. The
nice thing with the ABI (Application Binary Interface), that we have seen at steps 13 and 14,
is that it makes it possible to combine object files (`.o`) produced by different tools, as
soon as they respect the ABI, which is the case (of course) of the C compiler.

The example [FIRMWARE/sieve.c](FIRMWARE/sieve.c), taken from the examples in picorv is a good
candidate. It is interesting, it does multiplications, divisions and modulos using integer
numbers. These operations are not implemented by our RV32I core, but they are supported
by the compiler using functions in `libgcc.a`, and since we link with `libgcc.a`, this will
work. However, the program also uses `printf()` to display the result, and this function
is declared in `libc.a`. In principle, it would be possible to use it, but `printf()`
supports so many formats that its code is too large and will not fit in our 6 kB or RAM.
For this reason, we include a much smaller / much simpler version in
[FIRMWARE/print.c](FIRMWARE/print.c) (also taken from picorv), and included in the objects
to be linked with executables.

![](mandel_and_riscvlogo.png)

There are two other examples, a C version of the Mandelbrot program:
[FIRMWARE/mandel_C.c](FIRMWARE/mandel_C.c). It uses
[ANSI colors](https://stackoverflow.com/questions/4842424/list-of-ansi-color-escape-sequences) to display
low-resolution "graphics" in the terminal. There is also [FIRMWARE/riscv_logo.c](FIRMWARE/riscv_logo.c)
that displays a spinning Risc-V logo (in a 90-ish demoscene style !).

**Try this** Compile `sieve.c` (`cd FIRMWARE; make sieve.bram.hex`) and test it in simulation (`./run_verilator.sh step20.v`)
and on device (`BOARDS/run_xxx.sh step20.v; ./terminal.sh`).
Try the other programs. Write your own programs (if you do not have an idea, try for instance cellular automata, Life ...).
Note: the Verilator framework can directly load ELF executables in simulation (no need to regenerate `firmware.hex`). You can generate all
demo programs: `cd FIRMWARE; make hello.bram.elf mandelbrot.bram.elf mandel_C.bram.elf riscv_logo.bram.elf;cd ..`, then run the one that you want
using `./run_verilator.sh step20.v FIRMWARE/mandel_C.bram.elf` or `./obj_dir/FIRMWARE/mandel_C.bram.elf`.

Now you can see that your processor is not just a toy, it is a real
RISC-V processor on which you can run programs produced by standard
tools !

_Note_ on the IceStick, we only have `6kB` of RAM, so only tiny programs will fit. If the compiled
program is larger than `6kB` then you will get an error. A more problematic case is a program that
nearly fills the whole BRAM, then we have nearly no space for the stack, and the stack will overwrite
the rest, putting the CPU in an invalid state, probably frozen. This situation is difficult to understand /
to debug when you encounter it, so `firmware_words` displays a big warning message whenever the generated
code fills more than 95% of the BRAM.

## Step 22: Storing data: can I have more than 6 kB of memory ?

_and some optimizations in the processor_

![](IceStick_SPIFLASH.jpg)

On the IceStick, there are only 8 blocks of 1 kB of BRAM, and since we
need to use two of them for the registers, this leaves only 6 kB of
RAM for our programs. It is sufficient for small programs like
Mandelbrot or little graphic demos, but you will very soon reach the
limit. The IceStick has a little chip (see figure) with 4 MBs of FLASH
memory (other boards have a similar chip). When you synthesize a
design, it is stored in this FLASH memory. On startup, the FPGA loads
its configuration from this chip. The nice thing is that the FPGA
configuration takes no more than a few kilobytes, this leaves us a lot
of space to store our own data. But we will need to create some
additional hardware to communicate with this chip.

As you can see on the figure, this chip only has 8 legs, how can we
address 4 MBs of data using 8 pins only ?  In fact, this chip uses a
_serial protocol_ (SPI). To access data, one sends the address to be
read on a pin, one bit at a time, then the chip sends the data back on
another pin, one bit at a time. If you want to learn more about it,
my notes about SPI flash are
[here](https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/TUTORIALS/spi_flash.md)
and the VERILOG implementation is in [spi_flash.v](spi_flash.v).
It supports different protocols, depending on the used number of pins and whether pins are bidirectional.

The `MappedSPIFlash` module has the following interface:
```verilog
module MappedSPIFlash(
    input wire 	       clk,
    input wire 	       rstrb,
    input wire [19:0]  word_address,

    output wire [31:0] rdata,
    output wire        rbusy,

    output wire        CLK,
    output reg         CS_N,
    inout  wire [1:0]  IO
);
```

| signal       | description                                                    |
|--------------|----------------------------------------------------------------|
| clk          | system clock                                                   |
| rstrb        | read strobe, goes high whenever processor wants to read a word |
| word_address | address of the word to be read                                 |
| rdata        | data read from memory                                          |
| rbusy        | asserted if busy receiving data                                |
| CLK          | clock pin of the SPI flash chip                                |
| CS_N         | chip select pin of the SPI flash chip, active low              |
| IO           | two bidirectional pins for sending and receiving data          |

Now the idea is to modify our SOC in such a way that some addresses correspond to the SPI flash.
First we need to decide how it will be projected into the memory space of our processor. The
idea is to use bit 23 of memory addresses to select the SPI Flash. Bit 22 is for IO (LEDs, UART). In
addition, for IO, we need to check that bit 23 is zero. And if both bits 23 and 22 are zero, then we
are in BRAM. So our memory space is decomposed into four "quadrants" depending on bits 23 and 22, and
we use three of them.

Then we have the different
signals to discriminate the different zones of our memory:
```verilog
   wire isSPIFlash  = mem_addr[23];
   wire isIO        = mem_addr[23:22] == 2'b01;
   wire isRAM       = mem_addr[23:22] == 2'b00;
```

The `MappedSPIFlash` module is wired as follows:
```verilog
   wire SPIFlash_rdata;
   wire SPIFlash_rbusy;
   MappedSPIFlash SPIFlash(
      .clk(clk),
      .word_address(mem_wordaddr),
      .rdata(SPIFlash_rdata),
      .rstrb(isSPIFlash & mem_rstrb),
      .rbusy(SPIFlash_rbusy),
      .CLK(SPIFLASH_CLK),
      .CS_N(SPIFLASH_CS_N),
      .IO(SPIFLASH_IO)
   );
```
(the pins `SPIFLASH_CLK`, `SPIFLASH_CS_N`, `SPIFLASH_IO[0]` and `SPIFLASH_IO[1]` are declared
in the constraint file, in the `BOARDS` subdirectory).

The data sent to the processor has a three-ways mux:
```verilog
   assign mem_rdata = isRAM      ? RAM_rdata :
                      isSPIFlash ? SPIFlash_rdata :
	                           IO_rdata ;
```

OK, now our processor can automatically trigger a SPI flash read by accessing memory with bit 23 set in the
address, but how does it know that data is ready ? (remember, data arrives one bit at a time). There is
this `SPIFlash_rbusy` that goes high whenever `MappedSPIFlash` is busy receiving some data, we need to take it
into account in our processor's state machine. We add a new input signal `mem_rbusy` to our processor,
and modify the state machine as follows:
```verilog
   ...
   WAIT_DATA: begin
      if(!mem_rbusy) begin
	 state <= FETCH_INSTR;
      end
   end
   ...
```

Then, in the SOC, this signal is wired to `SPIFlash_rbusy`:
```verilog
   wire mem_rbusy;
   ...
   Processor CPU(
     ...
     .mem_rbusy(mem_rbusy),
     ...
   );
   ...
   assign mem_rbusy = SPIFlash_rbusy;
```

By the way, since we are revisiting the state machine, there is something
we can do. Remember this portion of the state machine, don't you think
we could go faster ?
```verilog
   WAIT_INSTR: begin
      instr <= mem_rdata;
      state <= FETCH_REGS;
   end
   FETCH_REGS: begin
      rs1 <= RegisterBank[rs1Id];
      rs2 <= RegisterBank[rs2Id];
      state <= EXECUTE;
   end
```

Yes, `rs1Id` and `rs2Id` are simply 5 wires (each) drawn from `instr`, so we can
get them from `mem_rdata` directly, and fetch the registers in the `WAIT_INSTR` state,
as follows:
```verilog
   WAIT_INSTR: begin
      instr <= mem_rdata;
      rs1 <= RegisterBank[mem_rdata[19:15]];
      rs2 <= RegisterBank[mem_rdata[24:20]];
      state <= EXECUTE;
   end
```
Doing so we gain one cycle per instruction, and it is an easy win !

Oh, and one more thing, why do we need a `LOAD` and a `STORE` state, could'nt we
initiate memory transfers in the `EXECUTE` state ? Yes we can, so we need to change the write mask and
read strobes accordingly, like that:
```verilog
   assign mem_rstrb = (state == FETCH_INSTR || (state == EXECUTE & isLoad));
   assign mem_wmask = {4{(state == EXECUTE) & isStore}} & STORE_wmask;
```

Then the state machine has 4 states only !
```verilog
   localparam FETCH_INSTR = 0;
   localparam WAIT_INSTR  = 1;
   localparam EXECUTE     = 2;
   localparam WAIT_DATA   = 3;
   reg [1:0] state = FETCH_INSTR;
   always @(posedge clk) begin
      if(!resetn) begin
	 PC    <= 0;
	 state <= FETCH_INSTR;
      end else begin
	 if(writeBackEn && rdId != 0) begin
	    RegisterBank[rdId] <= writeBackData;
	 end
	 case(state)
	   FETCH_INSTR: begin
	      state <= WAIT_INSTR;
	   end
	   WAIT_INSTR: begin
	      instr <= mem_rdata;
	      rs1 <= RegisterBank[mem_rdata[19:15]];
	      rs2 <= RegisterBank[mem_rdata[24:20]];
	      state <= EXECUTE;
	   end
	   EXECUTE: begin
	      if(!isSYSTEM) begin
		 PC <= nextPC;
	      end
	      state <= isLoad  ? WAIT_DATA : FETCH_INSTR;
	   end
	   WAIT_DATA: begin
	      if(!mem_rbusy) begin
		 state <= FETCH_INSTR;
	      end
	   end
	 endcase
      end
   end
```

There are several other things that we can optimize. First thing, you may have noticed that
the two LSBs of the instructions are always `2'b11` in RV32I, so we do not need to load them:
```verilog
   reg [31:2] instr;
   ...
   instr <= mem_rdata[31:2];
   ...
   wire isALUreg  =  (instr[6:2] == 5'b01100);
   ...
```

Something else: we are doing all address computations with 32 bits, whereas our address space
has 24 bits only, we can save significant resources there:
```verilog
   localparam ADDR_WIDTH=24;
   wire [ADDR_WIDTH-1:0] PCplusImm = PC + ( instr[3] ? Jimm[31:0] :
				  instr[4] ? Uimm[31:0] :
				             Bimm[31:0] );
   wire [ADDR_WIDTH-1:0] PCplus4 = PC+4;

   wire [ADDR_WIDTH-1:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm   :
	                                  isJALR   ? {aluPlus[31:1],1'b0} :
	                                             PCplus4;

   wire [ADDR_WIDTH-1:0] loadstore_addr = rs1 + (isStore ? Simm : Iimm);
```

The up to date verilog file is avalaible in [step22.v](step22.v). Let us now check
that we are able to access the SPI flash from our processor, with the following
[program](FIRMWARE/read_spiflash.c):
```C
#include "io.h"
#define SPI_FLASH_BASE ((char*)(1 << 23))
int main()  {
   for(int i=0; i<16; ++i) {
      IO_OUT(IO_LEDS,i);
      int lo = (int)SPI_FLASH_BASE[2*i  ];
      int hi = (int)SPI_FLASH_BASE[2*i+1];
      print_hex_digits((hi << 8) | lo,4); // print four hexadecimal digits
      printf(" ");
   }
   printf("\n");
}
```

The SPI flash is mapped in memory space, using addresses with bit 23 set (the
first address, that we call `SPI_FLASH_BASE`, is `1 << 23`). Then we access all individual
bytes, and display them by grouping them into 16-bit words (for each word, the first byte
in memory is the least significant one, because RISC-V follows the little-endian convention).
We have a `print_hex_digits()` function in [FIRMWARE/print.c](FIRMWARE/print.c) that does the job
(the second argument is the number of hex characters we want to print for each number).

Now compile the program, synthesize the design and send it to the device as follows:

```
 $ cd FIRMWARE
 $ make read_spiflash.bram.hex
 $ cd ..
 $ BOARDS/run_icestick.sh step22.v
 $ ./terminal.sh
```

... and you see nothing. While is this so ? The program finished before you started the terminal,
so we were not able to see anything, but you can reset the processor, pushing the invisible reset
button (mentioned in [step 2](README.md#step-2-slower-blinky)). Each time you push the
"button", it will display on the terminal the first 16 words stored in the SPI flash.
On a IceStick, you will see something like:
```
00FF FF00 AA7E 7E99 0051 0501 0092 6220 4B01 0072 8290 0000 0011 0101 0000 0000
```

Do you have an idea where these values come from ? Remember why there is this SPI flash chip on your FPGA
board: it is where your design is stored. When the FPGA starts, it loads its design from the SPI flash. The
design corresponds to the file `SOC.bin`, that is generated at the end of the `yosys/nextpnr/icepack` pipeline:
- `yosys` transforms your verilog into a "circuit", also called a "netlist"
- then `nextpnr` maps the gates  of this circuit to the logical elements of the FPGA,
- and finally `icepack` converts the result into a "binary stream" directly understood by the FPGA.

Let us examine the 16 first words of the binary stream:

```
  $ od -x -N 32 SOC.bin
```

Then you'll see something like:
```
0000000 00ff ff00 aa7e 7e99 0051 0501 0092 6220
0000020 4b01 0072 8290 0000 0011 0101 0000 0000
0000040
```

and this corresponds to what we have just seen on the terminal, read from the SPI flash chip.
So our CPU can read its own FPGA representation from the SPI flash, like a biologist sequencing his
hown DNA ! While it has a nice and intriguing recursion flavor, it is probably of very little practical
use, but let us take a deeper look at it: the `SOC.bin` file is not very large:

```
$ ls -al SOC.bin
-rw-rw-r-- 1 blevy blevy 32220 Jan  7 07:31 SOC.bin
```

It weights only `32KB` or so, and our SPI flash chip has capacity for `4MB`, so there is plenty of room for us !
The only thing we need to take care of is not overwriting the FPGA configuration (in other words, always start further
away then the size of `SOC.bin`). So we will use a `1MB` offset for storing our data (you will say we are wasting a lot
of space between `32KB` and `1MB` but we shall use that space for something else in subsequent steps of this tutorial).

**Try this** Create a text file `hello.txt`, send it to the FPGA at the `1MB` offset (see below how to do that), write
a program that displays the stored file. To know where to stop, you may need either to decide for a termination character
or to precode the length of the file.

For ICE40 boards (IceStick, IceBreaker, ...), use:
```
 $ iceprog -o 1M hello.txt
```

For ECP5 boards (ULX3S), use:
```
 $ cp hello.txt hello.img
 $ ujprog -j flash -f 1048576 hello.img
```
(using latest version of `ujprog` compiled from [https://github.com/kost/fujprog](https://github.com/kost/fujprog)).


![](ST_NICCC_tty.png)

OK, so now we are ready to use the new storage that we have for more interesting things.
What we will do is displaying
an animation on the terminal. The animation is a demo from the 90's, that streams
polygon data to a software polygon renderer. Polygon data is a 640 kB binary file,
available from `learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat` (see other
files in the same directory for more information about the file format). First
thing to do is writing the file to the SPI flash, from a 1MBytes offset. For
ICE40-based boards (IceStick, IceBreaker), use:

```
 $ iceprog -o 1M learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat
```

For ECP5 boards (ULX3S), use:
```
 $ cp learn_fpga/FemtoRV/FIRMWARE/EXAMPLES/DATA/scene1.dat scene1.img
 $ ujprog -j flash -f 1048576 scene1.img
```
(using latest version of `ujprog` compiled from [https://github.com/kost/fujprog](https://github.com/kost/fujprog)).

Now you can compile the program:
```
 $ cd FIRMWARE
 $ make ST_NICCC.bram.hex
 $ cd ..
```
and send the design and the program to the device:
```
 $ BOARDS/run_xxx.sh step22.v
 $ ./terminal.sh
```
**Try this** Store an image in SPI Flash (in a format that is easy to read), and write a program to display it.
You can use `printf("\033[48;2;%d;%d;%dm ",R,G,B);` to send a pixel (where `R`,`G`,`B` are numbers between 0 and 255),
and `printf("\033[48;2;0;0;0m\n");` after each scanline.

## Step 23: running programs from SPI Flash, first steps

With what we have done in the previous step, we are now able to load data from the SPI flash, and we
have ample space for all our data, but we still have only 6 kB that is shared between our code and
variables, it is not much ! It would be great to be able to use the SPI flash to store our code,
and execute it directly from there. We were able to write nice demos that fit in 6 kB, imagine what
you could do with 2 MB for code, and the entire 6 kB available for your variables !

To be able to load code from the SPI flash, the only thing we need to
change is staying in the `WAIT_INSTR` state until `mem_rbusy` is zero, hence we
just need to test `mem_rbusy` before changing `state` to `EXECUTE`:

```verilog
   WAIT_INSTR: begin
      instr <= mem_rdata[31:2];
      rs1 <= RegisterBank[mem_rdata[19:15]];
      rs2 <= RegisterBank[mem_rdata[24:20]];
      if(!mem_rbusy) begin
	 state <= EXECUTE;
      end
   end
```

and we initialize the BRAM with the following program, that jumps to address `0x00820000`:

```verilog
   initial begin
      LI(a0,32'h00820000);
      JR(a0);
   end
```

This address corresponds to the address where the SPI flash is projected into the address space of our
CPU (`0x00800000` = 1 << 23) plus an offset of 128kB (`0x20000`). This offset of 128 kB is
necessary because remember, we share the SPI Flash with the FPGA that stores its configuration
in it !

OK, that's mostly it for the hardware part. Let us see now if we can execute code from there.
To do that, we will need a new linker script ([FIRMWARE/spiflash0.ld](FIRMWARE/spiflash0.ld)):

```
MEMORY {
   FLASH (RX)  : ORIGIN = 0x00820000, LENGTH = 0x100000 /* 4 MB in flash */
}
SECTIONS {
    everything : {
	. = ALIGN(4);
	start.o (.text)
        *(.*)
    } >FLASH
}
```

It is the same thing as before, but we tell the linker to put everything in flash memory (for now,
we will see later how it works for global variables). Let us test it with a program that does not
write to global variables, for instance [FIRMWARE/hello.S](FIRMWARE/hello.S). To link it using our
new linker script, we do:
```
  $ riscv64-unknown-elf-ld -T spiflash0.ld -m elf32lriscv -nostdlib -norelax hello.o putchar.o -o hello.spiflash0.elf
```

But since it is tedious to type, it is automated by the Makefile:
```
  $ make hello.spiflash0.elf
```

Now you need to convert the ELF executable into a flat binary:
```
  $ riscv64-unknown-elf-objcopy hello.spiflash0.elf hello.spiflash0.bin -O binary
```

or with our Makefile:
```
  $ make hello.spiflash0.bin
```

and send it to the SPI flash at offset 128k:
```
  $ iceprog -o 128k hello.spiflash0.bin
```

or with our Makefile:
```
  $ make hello.spiflash0.prog
```

and then:
```
  $ ./terminal.sh
```

## Step 24: running programs from SPI Flash, a better linker script

Before starting, let us make a little change in our core: when pushing the reset
button, it jumps at address 0, which is initialized as a jump to flash memory, but
after executing our program, it is possible (and highly probable) that the RAM will
have been used for something else, and no longer has the jump-to-flash instruction.
To fix this, one can make the CPU jump to flash memory each time reset goes low:

```verilog
   if(!resetn) begin
     PC    <= 32'h00820000;
     state <= WAIT_DATA;
   end
```

Note that state is set to WAIT_DATA, so that it waits for `mem_rbusy` to go low before
doing anything else.

OK, so now we have a large quantity of flash memory in which we can install the code
and run it from there. We can also install readonly variables in there, like the
string `.asciz "Hello, world !\n"` in the previous example. And what about local
variables ? They are allocated on the stack, that resides in the 6 kB of RAM that
we have, so it will work. How does it know where the stack is ? Remember, we
have written [FIRMWARE/start.S](FIRMWARE/start.S), that initializes `sp` at the
end of the RAM (`0x1800`) and it suffices.

But how does it work for a program like that ?
```C
  int x = 3;
  void main() {
     x = x + 1;
     printf("%d\n",x);
  }
```

The global variable `x` has an initial value that needs to be stored somewhere,
so we need to put it in flash memory, but we are modifying it after,
so we need to put it in RAM, how is it possible ?
In fact, what we need is a mechanism for storing all the initial values of the
(initialized) global variables in flash memory and copy them to RAM on startup.
To do that, we will need a new linker script (that indicates where to put the variables and
where to put their initial values) and a new `start.S` (that copies the initial
values to the variables). Let us see how to do that.

When you compile C code, the compiler inserts directives to indicate where the different
things go (sections). To take a look, generate assembly from one of our C programs:
```
$ cd FIRMWARE
$ make ST_NICCC.o
$ readelf -S ST_NICCC.o
```

it will show you the different sections that are present in the object file.

| section     | description        |
|-------------|--------------------|
| text        | executable code    |
| bss, sbss   | uninitialized data |
| data, sdata | read-only data     |
| rodata      | read-only data     |

The section name (bss) for uninitialized data has an historic reason
that dates back to the 60's (BSS: Block Started by Symbol is a
pseudo-instruction of an assembler for the IBM 704). Uninitialized
and initialized data sections come in two flavor, sbss and sdata
is for small uninitialized (resp) initialized) data.

In `readelf` output, there is also a `type` field. `PROGBIT` means that
some data needs to be loaded from the file (for `text`, `data` and `rodata`)
segments. `NOBITS` means that no data should be loaded (for `bss`). Then the `Addr`
indicates where the section will be mapped into memory (for a `.o` file, it is always 0,
but it is useful for a linked elf executable, you can check using `readelf`). Then
the `Offs` field indicates the offset for the section's data in the `.o` file, and
the `Size` field the number of bytes in the section.

So what we have to do is writing a linker script that will say the following things:
- `text` sections go to the flash memory
- `bss` sections go to BRAM
- `data` sections go to BRAM, but have their initial values stored in the flash memory

For `text` and `bss`, we already know how to do it. For `data`, linker scripts can specify
a LMA (Load Memory Address), that indicates where initial values need to be stored. In our
linker script, we will have something like:

```
  MEMORY {
      FLASH (rx)  : ORIGIN = 0x00820000, LENGTH = 0x100000
      RAM   (rwx) : ORIGIN = 0x00000000, LENGTH = 0x1800
  }
  SECTIONS {

    .data: AT(address_in_spi_flash) {
      *(.data*)
      *(.sdata*)
    } > RAM

    .text : {
      start_spiflash1.o(.text)
      *(.text*)
      *(.rodata*)
      *(.srodata*)
    } >FLASH

    .bss : {
      *(.bss*)
      *(.sbss*)
    } >RAM
  }
```

Each section indicates how to map sections read from object files to sections in the executable
(`.data`, `.text` and `.bss`), and how to map these sections to the flash memory and to the BRAM.
For each section, some pattern matching rules indicate which sections from the object files are
concerned. For the `.text` section, we make sure that the first section is the
text section of `start_spiflash1.o`, because our processor jumps there on reset.
Note also that we put the readonly data (`.rodata` and `.srodata`) into the flash.

For the `.data` section, the `AT` keyword indicates the LMA (Load
Memory Address) where the linker will put the initial values (an
address in spi flash), and whenever a symbol in a `data` or `sdata`
section is referenced, the linker will use its address in RAM.

But a question remains: how does the system know that it should copy initialization data from
the flash into BRAM ? How does it know at which address ? How can we initialize uninitialized
data (BSS) to zero ? In fact we need to do it by hand,
in the startup code `start_spiflash1.S`, that looks like that:

```asm
.equ IO_BASE, 0x400000

.text
.global _start
.type _start, @function

_start:
.option push
.option norelax
     li  gp,IO_BASE
.option pop

     li   sp,0x1800

# zero-init bss section:
     la a0, _sbss
     la a1, _ebss
     bge a0, a1, end_init_bss
loop_init_bss:
     sw zero, 0(a0)
     addi a0, a0, 4
     blt a0, a1, loop_init_bss
end_init_bss:

# copy data section from SPI Flash to BRAM:
     la a0, _sidata
     la a1, _sdata
     la a2, _edata
     bge a1, a2, end_init_data
loop_init_data:
     lw a3, 0(a0)
     sw a3, 0(a1)
     addi a0, a0, 4
     addi a1, a1, 4
     blt a1, a2, loop_init_data
end_init_data:

     call main
     ebreak
```

- The first thing that we do is initializing the stack pointer and the general
  pointer `gp` (with the IO page address in our case).
- the first loop clears the memory between `_sbss` and `_ebss`.
- the second loop copies data from `_sidata` to `_sdata` ... `_edata`
- finally we call `main`

... but wait a minute, how do we know the values
for `_sbss`,`_ebss`,`_sidata`,`_sdata`,`_edata` ?

In fact, the linker script can generate them for us. Here is
what the `.data` section looks like:

```
    .data : AT ( _sidata ) {
        . = ALIGN(4);
        _sdata = .;
        *(.data*)
        *(.sdata*)
        . = ALIGN(4);
        _edata = .;
    } > RAM
```

where `.` denotes the current address. In addition, lines like `. = ALIGN(4);`
make sure that addresses remain aligned on 4-bytes boundaries, since our
initialization loops in `start_spiflash1.S` depend on that.

The declaration for the `.text` section looks like:

```
    .text : {
        . = ALIGN(4);
        start_spiflash1.o(.text)
        *(.text*)
        . = ALIGN(4);
        *(.rodata*)
        *(.srodata*)
        _etext = .;
        _sidata = _etext;
    } >FLASH
```

note that it declares `_sidata` right at the end of the text section, so that the `.data` section can
put its initialization data there.

OK, so let us try it with one of our examples:
```
  $ cd FIRMWARE
  $ make mandel_C.spiflash1.prog
  $ cd ..
  $ ./terminal.sh
```

Yes, it works, but _wait a minute_, it is significantly slower than before. Can you guess why ?

Remember that the FLASH memory is a *serial* memory, wich means that addresses are sent one bit
at a time and the result is obtained also one bit at a time (well, in fact two bits at a time
for both in our case), it is much slower than the BRAM that gets a 32-bits value in one cycle.
Can we do something ? Sure we can ! What about putting some critical functions in BRAM ? To do
that, we can change our linker script as follows (result in [FIRMWARE/spiflash2.ld](FIRMWARE/spiflash2.ld)):

```
    .data_and_fastcode : AT ( _sidata ) {
        . = ALIGN(4);
        _sdata = .;

	/* Initialized data */
        *(.data*)
        *(.sdata*)

	/* integer mul and div */
	*/libgcc.a:muldi3.o(.text)
	*/libgcc.a:div.o(.text)

	putchar.o(.text)
	print.o(.text)

	/* functions with attribute((section(".fastcode"))) */
	*(.fastcode*)

        . = ALIGN(4);
        _edata = .;
    } > RAM
```

By doing so, we indicate that some specific functions (integer multiply and
divide from libgcc and IO functions) should be put in fast RAM, and that's
all we have to do ! The linker will put the code for these functions in the
same section as the initialization data for initialized variables, and
our runtime `start_spiflash1.S` will copies them with the initialization data
to RAM at startup, cool !

Let us try it with our example:

```
  $ cd FIRMWARE
  $ make mandel_C.spiflash2.prog
  $ cd ..
  $ ./terminal.sh
```

Aaaah, much better !

Note also the line `*(.fastcode*)`: you can put your own functions in BRAM, by
indicating that they are in a `fastcode` section. In C, you can do that as
follows:

```C
 void my_function(my args ...) __attribute((section(".fastcode")));
 void my_function(my args ...) {
      ...
 }
```

**Try this** run the `ST_NICCC` demo (`make ST_NICCC.spiflash2.prog`). Then uncomment
the line in `ST_NICCC.c` with the definition for `RV32_FASTCODE` and re-run it.

![](tinyraytracer_tty.png)

Now we can run larger programs on our device:
- [FIRMWARE/pi.c](FIRMWARE/pi.c) (by Fabrice Beillard, computes the decimals of pi)
- [FIRMWARE/tinyraytracer.c](FIRMWARE/tinyraytracer.c) (by Dmitry Sokolov, raytracing)

Both of them use floating point numbers. For a RV32I core such as ours, floating point numbers use
routines implemented in `libgcc`. As a consequence, executables are larger (`pi` weights 17 kB and
`tinyraytracer` weights 25 kB) and would have been impossible to run in 6 kB of RAM. The additional
memory offered by the SPI FLASH offers much more possibilities to our device !

At this point, not only our device runs code compiled using standard tools (gcc), but also it runs
existing code, independently developped (the mathematical routines in `libgcc`). It is quite exciting
to run existing binary code on a processor that you create on your own !

## Next tutorial

[Pipelining](PIPELINE.md)

## Files for all the steps

- [step 1](step1.v): Blinker, too fast, can't see anything
- [step 2](step2.v): Blinker with clockworks
- [step 3](step3.v): Blinker that loads pattern from ROM
- [step 4](step4.v): The instruction decoder
- [step 5](step5.v): The register bank and the state machine
- [step 6](step6.v): The ALU
- [step 7](step7.v): Using the VERILOG assembler
- [step 8](step8.v): Jumps
- [step 9](step9.v): Branches
- [step 10](step10.v): LUI and AUIPC
- [step 11](step11.v): Memory in separate module
- [step 12](step12.v): Size optimization: the Incredible Shrinking Core !
- [step 13](step13.v): Subroutines 1 (standard Risc-V instruction set)
- [step 14](step14.v): Subroutines 2 (using Risc-V pseudo-instructions)
- [step 15](step15.v): Load
- [step 16](step16.v): Store
- [step 17](step17.v): Memory-mapped devices
- [step 18](step18.v): Mandelbrot set
- step 19: Faster simulation with Verilator
- [step 20](step20.v): Using the GNU toolchain to compile assembly programs
- step 21: Using the GNU toolchain to compile C programs
- [step 22](step22.v): More memory ! Using the SPI Flash
- [step 23](step23.v): Running programs from the SPI Flash, first steps
- [step 24](step24.v): Running programs from the SPI Flash, better linker script

_WIP_

- step 25: More devices (LED matrix, OLED screen...)