Tiny CPU
Building a RISC-V CPU from scratch in VerilogThis is a writeup of my RISC-V processor implementation in Verilog. The goal was to build a working CPU that could execute real programs and communicate with the outside world. This is largely adapted from the excellent learn-fpga repo. If you want to build your own RISC-V CPU, I’d suggest taking a look!
Foundation: Clock management
Hardware runs on clocks, but FPGA clocks run at MHz speeds: far too fast to observe LEDs blinking or debug with your eyes. The first step is a clock divider.
module Clockworks #(
parameter SLOW = 0
)(
input CLK,
input RESET,
output clk,
output resetn
);
reg [SLOW:0] slow_CLK = 0;
always @(posedge CLK) begin
slow_CLK <= slow_CLK + 1;
end
assign clk = slow_CLK[SLOW];
assign resetn = 1;
endmodule
The SLOW parameter determines the division factor. If SLOW = 16, the output clock runs at $1/2^{16}$ the speed of the input. For simulation, you can set SLOW = 0 to run at full speed, but then your terminal gets spammed.
Instruction decoding
RISC-V’s instruction encoding is elegant: every instruction is exactly 32 bits, and different instruction types follow consistent patterns. The opcode (bits 6:0) determines the instruction class:
wire [6:0] opcode = instr[6:0];
wire isLUI = (opcode == 7'b0110111);
wire isAUIPC = (opcode == 7'b0010111);
wire isJAL = (opcode == 7'b1101111);
wire isJALR = (opcode == 7'b1100111);
wire isBranch = (opcode == 7'b1100011);
wire isLoad = (opcode == 7'b0000011);
wire isStore = (opcode == 7'b0100011);
wire isALUimm = (opcode == 7'b0010011);
wire isALUreg = (opcode == 7'b0110011);
wire isSYSTEM = (opcode == 7'b1110011);
Register fields are consistent across instruction types:
rs1(source register 1): bits 19:15rs2(source register 2): bits 24:20rd(destination register): bits 11:7funct3(function modifier): bits 14:12funct7(function modifier): bits 31:25
wire [4:0] rs1 = instr[19:15];
wire [4:0] rs2 = instr[24:20];
wire [4:0] rd = instr[11:7];
wire [2:0] funct3 = instr[14:12];
wire [6:0] funct7 = instr[31:25];
Immediate values are more complex: they’re scattered across different bit positions depending on the instruction type, and they need sign extension:
// I-type: 12-bit immediate
wire [31:0] Iimm = {{21{instr[31]}}, instr[30:20]};
// U-type: 20-bit immediate, shifted left 12 bits
wire [31:0] Uimm = {instr[31:12], 12'b0};
// S-type: 12-bit immediate for stores
wire [31:0] Simm = {{21{instr[31]}}, instr[30:25], instr[11:7]};
// B-type: 13-bit immediate for branches (LSB always 0)
wire [31:0] Bimm = {{20{instr[31]}}, instr[7], instr[30:25], instr[11:8], 1'b0};
// J-type: 21-bit immediate for jumps (LSB always 0)
wire [31:0] Jimm = {{12{instr[31]}}, instr[19:12], instr[20], instr[30:21], 1'b0};
The sign extension ({21{instr[31]}}) replicates the sign bit to fill the upper bits. This is how you represent negative numbers in two’s complement.
The ALU
The ALU handles arithmetic and logic operations. The funct3 field specifies which operation, and for some instructions (like SUB vs ADD), funct7 provides additional disambiguation.
wire [31:0] aluIn1 = rs1val;
wire [31:0] aluIn2 = isALUreg | isBranch ? rs2val : Iimm;
// Compute both addition and subtraction
wire [31:0] aluPlus = aluIn1 + aluIn2;
wire [32:0] aluMinus = {1'b0, aluIn1} - {1'b0, aluIn2};
// Comparison operations
wire EQ = (aluMinus[31:0] == 0);
wire LTU = aluMinus[32]; // Unsigned less than (carry bit)
wire LT = (aluIn1[31] ^ aluIn2[31]) ? aluIn1[31] : aluMinus[32];
For comparisons, I use a 33-bit subtraction. Isn’t it convenient that we’re not actually limited to 32-bit chips? The carry bit (bit 32) tells you if aluIn1 < aluIn2 (unsigned). For signed comparisons, if the signs differ, the negative number is smaller; otherwise, check the carry bit.
Shifts are trickier. RISC-V has left shift (SLL), logical right shift (SRL), and arithmetic right shift (SRA). To implement left shift using right shift hardware, you reverse the bits, shift right, then reverse again:
function [31:0] flip32;
input [31:0] x;
flip32 = {x[0], x[1], x[2], ..., x[31]}; // Bit reversal
endfunction
wire [31:0] shifter_in = (funct3 == 3'b001) ? flip32(aluIn1) : aluIn1;
wire [31:0] shifter = $signed({instr[30] & aluIn1[31], shifter_in}) >>> aluIn2[4:0];
wire [31:0] leftshift = flip32(shifter);
The ALU output multiplexer:
reg [31:0] aluOut;
always @(*) begin
case (funct3)
3'b000: aluOut = (funct7[5] & instr[5]) ? aluMinus[31:0] : aluPlus;
3'b001: aluOut = leftshift;
3'b010: aluOut = {31'b0, LT};
3'b011: aluOut = {31'b0, LTU};
3'b100: aluOut = (aluIn1 ^ aluIn2);
3'b101: aluOut = shifter;
3'b110: aluOut = (aluIn1 | aluIn2);
3'b111: aluOut = (aluIn1 & aluIn2);
endcase
end
Branch logic
Branches use the ALU’s comparison results to decide whether to jump:
reg takeBranch;
always @(*) begin
case (funct3)
3'b000: takeBranch = EQ; // BEQ
3'b001: takeBranch = !EQ; // BNE
3'b100: takeBranch = LT; // BLT
3'b101: takeBranch = !LT; // BGE
3'b110: takeBranch = LTU; // BLTU
3'b111: takeBranch = !LTU; // BGEU
default: takeBranch = 1'b0;
endcase
end
The next program counter logic handles all control flow:
wire [31:0] PCplusImm = PC + (isJAL ? Jimm : isAUIPC ? Uimm : Bimm);
wire [31:0] PCplus4 = PC + 4;
wire [31:0] nextPC = ((isBranch && takeBranch) || isJAL) ? PCplusImm :
isJALR ? {aluPlus[31:1], 1'b0} :
PCplus4;
JALR is special: it computes the target address using the ALU (rs1 + immediate), then clears the LSB to ensure alignment.
Memory operations
Load and store instructions access memory, but they can operate on bytes (8-bit), halfwords (16-bit), or words (32-bit). The processor needs to handle alignment and sign extension.
The load address is computed as rs1 + immediate:
wire [31:0] loadstore_addr = rs1val + (isStore ? Simm : Iimm);
For loads, extract the appropriate byte or halfword based on the lower address bits:
wire [15:0] LOAD_halfword = loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];
wire [7:0] LOAD_byte = loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];
wire mem_byteAccess = funct3[1:0] == 2'b00;
wire mem_halfwordAccess = funct3[1:0] == 2'b01;
wire LOAD_sign = !funct3[2] & (mem_byteAccess ? LOAD_byte[7] : LOAD_halfword[15]);
wire [31:0] LOAD_data = mem_byteAccess ? {{24{LOAD_sign}}, LOAD_byte} :
mem_halfwordAccess ? {{16{LOAD_sign}}, LOAD_halfword} :
mem_rdata;
If funct3[2] is 0, it’s a signed load (LB, LH). If it’s 1, it’s unsigned (LBU, LHU). Sign extension replicates the sign bit to fill the upper bits.
Stores need to generate a write mask indicating which bytes to write:
wire [3:0] STORE_mask = mem_byteAccess ?
(loadstore_addr[1] ?
(loadstore_addr[0] ? 4'b1000 : 4'b0100) :
(loadstore_addr[0] ? 4'b0010 : 4'b0001)
) :
mem_halfwordAccess ?
(loadstore_addr[1] ? 4'b1100 : 4'b0011) : 4'b1111;
The data needs to be replicated to the correct byte lanes:
assign mem_wdata[7:0] = rs2val[7:0];
assign mem_wdata[15:8] = loadstore_addr[0] ? rs2val[7:0] : rs2val[15:8];
assign mem_wdata[23:16] = loadstore_addr[0] ? rs2val[7:0] : rs2val[23:16];
assign mem_wdata[31:24] = loadstore_addr[0] ? rs2val[7:0] :
loadstore_addr[1] ? rs2val[15:8] : rs2val[31:24];
The register file
RISC-V has 32 general-purpose registers. Register 0 (x0) is hardwired to zero: writes to it are discarded, and reads always return 0.
reg [31:0] RegisterBank [0:31];
reg [31:0] rs1val;
reg [31:0] rs2val;
// Register reads happen in the ID stage
always @(posedge clk) begin
rs1val <= RegisterBank[rs1];
rs2val <= RegisterBank[rs2];
end
// Register writes happen when writeBackEn is high
always @(posedge clk) begin
if (writeBackEn && rd != 0) begin
RegisterBank[rd] <= writeBackData;
end
end
The writeback data can come from several sources:
assign writeBackData = (isJAL || isJALR) ? PCplus4 : // Return address
isLUI ? Uimm : // Upper immediate
isAUIPC ? PCplusImm : // PC + upper immediate
isLoad ? LOAD_data : // Memory load
aluOut; // ALU result
The state machine
The processor uses a multi-cycle design with distinct stages:
localparam IF = 0; // Instruction Fetch
localparam WAIT_IF = 1; // Wait for instruction memory
localparam ID = 2; // Instruction Decode
localparam EX = 3; // Execute
localparam LOAD = 4; // Memory load
localparam WAIT_DATA = 5; // Wait for data memory
localparam STORE = 6; // Memory store
reg [2:0] state = IF;
The state machine:
always @(posedge clk) begin
case (state)
IF: begin
state <= WAIT_IF;
end
WAIT_IF: begin
instr <= mem_rdata;
state <= ID;
end
ID: begin
rs1val <= RegisterBank[rs1];
rs2val <= RegisterBank[rs2];
state <= EX;
end
EX: begin
PC <= nextPC;
state <= isLoad ? LOAD : (isStore ? STORE : IF);
end
LOAD: begin
state <= WAIT_DATA;
end
WAIT_DATA: begin
state <= IF;
end
STORE: begin
state <= IF;
end
endcase
end
Memory-mapped I/O
Instead of special I/O instructions, memory-mapped I/O uses specific memory addresses to access devices. I use bit 22 of the address to distinguish between RAM and I/O space:
wire [29:0] mem_wordaddr = mem_addr[31:2];
wire isIO = mem_addr[22];
wire isRAM = !isIO;
This means addresses below 0x400000 are RAM, and addresses above are I/O devices. Within I/O space, I use specific address bits to select devices:
localparam IO_LEDS_bit = 0; // Address 0x400001
localparam IO_UART_DATA_bit = 1; // Address 0x400002
localparam IO_UART_CTRL_bit = 2; // Address 0x400004
LED control is straightforward: writing to the LED address updates the LED register:
always @(posedge clk) begin
if (isIO & mem_wstrb & mem_wordaddr[IO_LEDS_bit]) begin
LEDS <= mem_wdata;
end
end
UART communication
UART (Universal Asynchronous Receiver/Transmitter) provides serial communication. I integrated a simple transmitter module:
wire uart_valid = isIO & mem_wstrb & mem_wordaddr[IO_UART_DATA_bit];
wire uart_ready;
corescore_emitter_uart #(
.clk_freq_hz(BOARD_FREQ * 1000000),
.baud_rate(115200)
) UART (
.i_clk(clk),
.i_rst(!resetn),
.i_data(mem_wdata[7:0]),
.i_valid(uart_valid),
.o_ready(uart_ready),
.o_uart_tx(TXD)
);
The UART has a ready signal indicating whether it can accept new data. Reading from the UART control address returns the busy status:
wire [31:0] IO_rdata = mem_wordaddr[IO_UART_CTRL_bit] ?
{22'b0, !uart_ready, 9'b0} : 32'b0;
Hello world and debugging
With UART working, I wrote a program to print “Hello!” via the serial port. The assembly (yes, literally written using Verilog macros, I really need to get a C compiler going) stores the string in memory and iterates through each character:
Label(hello_str_);
DATAB(8'd72, 8'd101, 8'd108, 8'd108); // "Hell"
DATAB(8'd111, 8'd33, 8'd10, 0); // "o!\n\0"
Label(init_);
LI(gp, 32'h400000); // I/O base address
LI(s0, hello_str_); // String pointer
Label(loop_);
LBU(a0, s0, 0); // Load byte from string
BEQZ(a0, LabelRef(exit_)); // If null, exit
CALL(LabelRef(putc_)); // Print character
ADDI(s0, s0, 1); // Next character
J(LabelRef(loop_));
Label(putc_);
SW(a0, gp, IO_UART_DATA); // Write to UART data
LI(t0, 1 << 9); // Busy bit mask
Label(wait_uart_);
LW(t1, gp, IO_UART_DATA); // Read UART status
AND(t1, t1, t0); // Check busy bit
BNEZ(t1, LabelRef(wait_uart_)); // Wait if busy
RET();
But it didn’t work. The processor would hang on the first load instruction.
After simulation debugging, I found the issue: the load logic was incorrectly extracting bytes and halfwords from memory. The problem was in how I was using the address bits to select the correct byte:
The buggy code was using mem_wordaddr (which is mem_addr[31:2]) to index into the byte selection logic. This was wrong. I needed to use the full loadstore_addr, including the lower 2 bits, to determine which byte or halfword to extract.
The fix:
// Use the full load/store address, not the word-aligned version
wire [15:0] LOAD_halfword = loadstore_addr[1] ? mem_rdata[31:16] : mem_rdata[15:0];
wire [7:0] LOAD_byte = loadstore_addr[0] ? LOAD_halfword[15:8] : LOAD_halfword[7:0];
This correctly uses bits [1:0] of the address to select the proper byte or halfword. With this fix, the load instructions worked, and “Hello!” appeared on the serial terminal.
What’s next?
This processor implements most of the RV32I base instruction set and can run non-trivial programs. But there’s a lot I want to do:
- Pipelining is the obvious next step and pretty well-covered in the textbooks. Hopefully branch prediction and data hazards aren’t too much of a snag.
- This runs in simulation. But an FPGA is in the mail as I write this so hopefully I’ll get some nice blinking lights to show you all.
- I’m writing assembly by hand. Integrating with the RISC-V GCC toolchain would let me write C programs and run them on the processor.
- Currently, the processor polls I/O devices. Interrupt support would let it respond to events asynchronously.
If you’re interested in computer architecture, I recommend working through this excellent From Blinker to RISC-V tutorial. The RISC-V spec is also surprisingly readable.
The code is available at github.com/n-milo/tiny-cpu.