Tiny CPU

Building a RISC-V CPU from scratch in Verilog

May 17, 2024 • 10 minute read project , riscv , verilog

This is a writeup of my RISC-V processor implementation in Verilog. The goal was to build a working CPU that could execute real programs and communicate with the outside world. This is largely adapted from the excellent learn-fpga repo. If you want to build your own RISC-V CPU, I’d suggest taking a look!

Foundation: Clock management

Hardware runs on clocks, but FPGA clocks run at MHz speeds: far too fast to observe LEDs blinking or debug with your eyes. The first step is a clock divider.

module Clockworks #(
    parameter SLOW = 0
)(
    input  CLK,
    input  RESET,
    output clk,
    output resetn
);
    reg [SLOW:0] slow_CLK = 0;
    always @(posedge CLK) begin
        slow_CLK <= slow_CLK + 1;
    end

    assign clk = slow_CLK[SLOW];
    assign resetn = 1;
endmodule

The SLOW parameter determines the division factor. If SLOW = 16, the output clock runs at $1/2^{16}$ the speed of the input. For simulation, you can set SLOW = 0 to run at full speed, but then your terminal gets spammed.

Instruction decoding

RISC-V’s instruction encoding is elegant: every instruction is exactly 32 bits, and different instruction types follow consistent patterns. The opcode (bits 6:0) determines the instruction class:

wire [6:0] opcode = instr[6:0];
wire isLUI    = (opcode == 7'b0110111);
wire isAUIPC  = (opcode == 7'b0010111);
wire isJAL    = (opcode == 7'b1101111);
wire isJALR   = (opcode == 7'b1100111);
wire isBranch = (opcode == 7'b1100011);
wire isLoad   = (opcode == 7'b0000011);
wire isStore  = (opcode == 7'b0100011);
wire isALUimm = (opcode == 7'b0010011);
wire isALUreg = (opcode == 7'b0110011);
wire isSYSTEM = (opcode == 7'b1110011);

rs1 (source register 1): bits 19:15
rs2 (source register 2): bits 24:20
rd (destination register): bits 11:7
funct3 (function modifier): bits 14:12
funct7 (function modifier): bits 31:25

wire [4:0] rs1 = instr[19:15];
wire [4:0] rs2 = instr[24:20];
wire [4:0] rd  = instr[11:7];
wire [2:0] funct3 = instr[14:12];
wire [6:0] funct7 = instr[31:25];

Immediate values are more complex: they’re scattered across different bit positions depending on the instruction type, and they need sign extension:

// I-type: 12-bit immediate
wire [31:0] Iimm = {{21{instr[31]}}, instr[30:20]};

// U-type: 20-bit immediate, shifted left 12 bits
wire [31:0] Uimm = {instr[31:12], 12'b0};

// S-type: 12-bit immediate for stores
wire [31:0] Simm = {{21{instr[31]}}, instr[30:25], instr[11:7]};

// B-type: 13-bit immediate for branches (LSB always 0)
wire [31:0] Bimm = {{20{instr[31]}}, instr[7], instr[30:25], instr[11:8], 1'b0};

// J-type: 21-bit immediate for jumps (LSB always 0)
wire [31:0] Jimm = {{12{instr[31]}}, instr[19:12], instr[20], instr[30:21], 1'b0};

The sign extension ({21{instr[31]}}) replicates the sign bit to fill the upper bits. This is how you represent negative numbers in two’s complement.

The ALU

The ALU handles arithmetic and logic operations. The funct3 field specifies which operation, and for some instructions (like SUB vs ADD), funct7 provides additional disambiguation.

wire [31:0] aluIn1 = rs1val;
wire [31:0] aluIn2 = isALUreg | isBranch ? rs2val : Iimm;

// Compute both addition and subtraction
wire [31:0] aluPlus = aluIn1 + aluIn2;
wire [32:0] aluMinus = {1'b0, aluIn1} - {1'b0, aluIn2};

// Comparison operations
wire EQ  = (aluMinus[31:0] == 0);
wire LTU = aluMinus[32];  // Unsigned less than (carry bit)
wire LT  = (aluIn1[31] ^ aluIn2[31]) ? aluIn1[31] : aluMinus[32];

For comparisons, I use a 33-bit subtraction. Isn’t it convenient that we’re not actually limited to 32-bit chips? The carry bit (bit 32) tells you if aluIn1 < aluIn2 (unsigned). For signed comparisons, if the signs differ, the negative number is smaller; otherwise, check the carry bit.

Shifts are trickier. RISC-V has left shift (SLL), logical right shift (SRL), and arithmetic right shift (SRA). To implement left shift using right shift hardware, you reverse the bits, shift right, then reverse again:

function [31:0] flip32;
    input [31:0] x;
    flip32 = {x[0], x[1], x[2], ..., x[31]};  // Bit reversal
endfunction

wire [31:0] shifter_in = (funct3 == 3'b001) ? flip32(aluIn1) : aluIn1;
wire [31:0] shifter = $signed({instr[30] & aluIn1[31], shifter_in}) >>> aluIn2[4:0];
wire [31:0] leftshift = flip32(shifter);

The ALU output multiplexer:

reg [31:0] aluOut;
always @(*) begin
    case (funct3)
        3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
        3'b001: aluOut = leftshift;
        3'b010: aluOut = {31'b0, LT};
        3'b011: aluOut = {31'b0, LTU};
        3'b100: aluOut = (aluIn1 ^ aluIn2);
        3'b101: aluOut = shifter;
        3'b110: aluOut = (aluIn1 | aluIn2);
        3'b111: aluOut = (aluIn1 & aluIn2);
    endcase
end

Branch logic

Branches use the ALU’s comparison results to decide whether to jump:

reg takeBranch;
always @(*) begin
    case (funct3)
        3'b000: takeBranch = EQ;   // BEQ
        3'b001: takeBranch = !EQ;  // BNE
        3'b100: takeBranch = LT;   // BLT
        3'b101: takeBranch = !LT;  // BGE
        3'b110: takeBranch = LTU;  // BLTU
        3'b111: takeBranch = !LTU; // BGEU
        default: takeBranch = 1'b0;
    endcase
end

The next program counter logic handles all control flow:

wire [31:0] PCplusImm = PC + (isJAL ? Jimm : isAUIPC ? Uimm : Bimm);
wire [31:0] PCplus4 = PC + 4;

wire [31:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm :
                     isJALR ? {aluPlus[31:1], 1'b0} :
                     PCplus4;

JALR is special: it computes the target address using the ALU (rs1 + immediate), then clears the LSB to ensure alignment.

Memory operations

Load and store instructions access memory, but they can operate on bytes (8-bit), halfwords (16-bit), or words (32-bit). The processor needs to handle alignment and sign extension.

The load address is computed as rs1 + immediate:

wire [31:0] loadstore_addr = rs1val + (isStore ? Simm : Iimm);

For loads, extract the appropriate byte or halfword based on the lower address bits:

wire [15:0] LOAD_halfword = loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];
wire [7:0]  LOAD_byte = loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];

wire mem_byteAccess = funct3[1:0] == 2'b00;
wire mem_halfwordAccess = funct3[1:0] == 2'b01;

wire LOAD_sign = !funct3[2] & (mem_byteAccess ? LOAD_byte[7] : LOAD_halfword[15]);

wire [31:0] LOAD_data = mem_byteAccess ? {{24{LOAD_sign}}, LOAD_byte} :
                        mem_halfwordAccess ? {{16{LOAD_sign}}, LOAD_halfword} :
                        mem_rdata;

If funct3[2] is 0, it’s a signed load (LB, LH). If it’s 1, it’s unsigned (LBU, LHU). Sign extension replicates the sign bit to fill the upper bits.

Stores need to generate a write mask indicating which bytes to write:

wire [3:0] STORE_mask = mem_byteAccess ?
                            (loadstore_addr[1] ?
                                (loadstore_addr[0] ? 4'b1000 : 4'b0100) :
                                (loadstore_addr[0] ? 4'b0010 : 4'b0001)
                            ) :
                        mem_halfwordAccess ?
                            (loadstore_addr[1] ? 4'b1100 : 4'b0011) : 4'b1111;

The data needs to be replicated to the correct byte lanes:

assign mem_wdata[7:0]   = rs2val[7:0];
assign mem_wdata[15:8]  = loadstore_addr[0] ? rs2val[7:0]  : rs2val[15:8];
assign mem_wdata[23:16] = loadstore_addr[0] ? rs2val[7:0]  : rs2val[23:16];
assign mem_wdata[31:24] = loadstore_addr[0] ? rs2val[7:0]  :
                          loadstore_addr[1] ? rs2val[15:8] : rs2val[31:24];

The register file

RISC-V has 32 general-purpose registers. Register 0 (x0) is hardwired to zero: writes to it are discarded, and reads always return 0.

reg [31:0] RegisterBank [0:31];
reg [31:0] rs1val;
reg [31:0] rs2val;

// Register reads happen in the ID stage
always @(posedge clk) begin
    rs1val <= RegisterBank[rs1];
    rs2val <= RegisterBank[rs2];
end

// Register writes happen when writeBackEn is high
always @(posedge clk) begin
    if (writeBackEn && rd != 0) begin
        RegisterBank[rd] <= writeBackData;
    end
end

The writeback data can come from several sources:

assign writeBackData = (isJAL || isJALR) ? PCplus4 :  // Return address
                       isLUI ? Uimm :                  // Upper immediate
                       isAUIPC ? PCplusImm :           // PC + upper immediate
                       isLoad ? LOAD_data :            // Memory load
                       aluOut;                         // ALU result

The state machine

The processor uses a multi-cycle design with distinct stages:

localparam IF        = 0;  // Instruction Fetch
localparam WAIT_IF   = 1;  // Wait for instruction memory
localparam ID        = 2;  // Instruction Decode
localparam EX        = 3;  // Execute
localparam LOAD      = 4;  // Memory load
localparam WAIT_DATA = 5;  // Wait for data memory
localparam STORE     = 6;  // Memory store

reg [2:0] state = IF;

The state machine:

always @(posedge clk) begin
    case (state)
        IF: begin
            state <= WAIT_IF;
        end

        WAIT_IF: begin
            instr <= mem_rdata;
            state <= ID;
        end

        ID: begin
            rs1val <= RegisterBank[rs1];
            rs2val <= RegisterBank[rs2];
            state <= EX;
        end

        EX: begin
            PC <= nextPC;
            state <= isLoad ? LOAD : (isStore ? STORE : IF);
        end

        LOAD: begin
            state <= WAIT_DATA;
        end

        WAIT_DATA: begin
            state <= IF;
        end

        STORE: begin
            state <= IF;
        end
    endcase
end

Memory-mapped I/O

Instead of special I/O instructions, memory-mapped I/O uses specific memory addresses to access devices. I use bit 22 of the address to distinguish between RAM and I/O space:

wire [29:0] mem_wordaddr = mem_addr[31:2];
wire isIO = mem_addr[22];
wire isRAM = !isIO;

This means addresses below 0x400000 are RAM, and addresses above are I/O devices. Within I/O space, I use specific address bits to select devices:

localparam IO_LEDS_bit = 0;       // Address 0x400001
localparam IO_UART_DATA_bit = 1;  // Address 0x400002
localparam IO_UART_CTRL_bit = 2;  // Address 0x400004

LED control is straightforward: writing to the LED address updates the LED register:

always @(posedge clk) begin
    if (isIO & mem_wstrb & mem_wordaddr[IO_LEDS_bit]) begin
        LEDS <= mem_wdata;
    end
end

UART communication

UART (Universal Asynchronous Receiver/Transmitter) provides serial communication. I integrated a simple transmitter module:

wire uart_valid = isIO & mem_wstrb & mem_wordaddr[IO_UART_DATA_bit];
wire uart_ready;

corescore_emitter_uart #(
    .clk_freq_hz(BOARD_FREQ * 1000000),
    .baud_rate(115200)
) UART (
    .i_clk(clk),
    .i_rst(!resetn),
    .i_data(mem_wdata[7:0]),
    .i_valid(uart_valid),
    .o_ready(uart_ready),
    .o_uart_tx(TXD)
);

The UART has a ready signal indicating whether it can accept new data. Reading from the UART control address returns the busy status:

wire [31:0] IO_rdata = mem_wordaddr[IO_UART_CTRL_bit] ?
                       {22'b0, !uart_ready, 9'b0} : 32'b0;

Hello world and debugging

With UART working, I wrote a program to print “Hello!” via the serial port. The assembly (yes, literally written using Verilog macros, I really need to get a C compiler going) stores the string in memory and iterates through each character:

Label(hello_str_);
    DATAB(8'd72, 8'd101, 8'd108, 8'd108);  // "Hell"
    DATAB(8'd111, 8'd33, 8'd10, 0);        // "o!\n\0"

Label(init_);
    LI(gp, 32'h400000);  // I/O base address
    LI(s0, hello_str_);  // String pointer

Label(loop_);
    LBU(a0, s0, 0);                    // Load byte from string
    BEQZ(a0, LabelRef(exit_));         // If null, exit
    CALL(LabelRef(putc_));             // Print character
    ADDI(s0, s0, 1);                   // Next character
    J(LabelRef(loop_));

Label(putc_);
    SW(a0, gp, IO_UART_DATA);          // Write to UART data
    LI(t0, 1 << 9);                    // Busy bit mask
Label(wait_uart_);
    LW(t1, gp, IO_UART_DATA);          // Read UART status
    AND(t1, t1, t0);                   // Check busy bit
    BNEZ(t1, LabelRef(wait_uart_));    // Wait if busy
    RET();

But it didn’t work. The processor would hang on the first load instruction.

After simulation debugging, I found the issue: the load logic was incorrectly extracting bytes and halfwords from memory. The problem was in how I was using the address bits to select the correct byte:

The buggy code was using mem_wordaddr (which is mem_addr[31:2]) to index into the byte selection logic. This was wrong. I needed to use the full loadstore_addr, including the lower 2 bits, to determine which byte or halfword to extract.

The fix:

// Use the full load/store address, not the word-aligned version
wire [15:0] LOAD_halfword = loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];
wire [7:0]  LOAD_byte = loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];

This correctly uses bits [1:0] of the address to select the proper byte or halfword. With this fix, the load instructions worked, and “Hello!” appeared on the serial terminal.

What’s next?

This processor implements most of the RV32I base instruction set and can run non-trivial programs. But there’s a lot I want to do:

Pipelining is the obvious next step and pretty well-covered in the textbooks. Hopefully branch prediction and data hazards aren’t too much of a snag.
This runs in simulation. But an FPGA is in the mail as I write this so hopefully I’ll get some nice blinking lights to show you all.
I’m writing assembly by hand. Integrating with the RISC-V GCC toolchain would let me write C programs and run them on the processor.
Currently, the processor polls I/O devices. Interrupt support would let it respond to events asynchronously.

If you’re interested in computer architecture, I recommend working through this excellent From Blinker to RISC-V tutorial. The RISC-V spec is also surprisingly readable.

The code is available at github.com/n-milo/tiny-cpu.