This is a continuation of my previous post. In this post I will cover the first iteration on the encryption portion of my Verilog implementation of the AES-256 algorithm. This post primarily focuses on the implementation. My previous post covers the algorithm itself in more detail. I left off at the key expansion portion of the algorithm and I continued with the implementation of the encryption portion.

The AES algorithm operates on a 4x4 state matrix that contains the input data. There are two initial operations: splitting the input data into the state matrix and adding the initial round key. Next, there are 14 rounds of encryption that operate on the state matrix. Each round consists of a byte substitution, a byte rotation, a column mixing operation, and adding the corresponding round key (the final round excludes the mix column operation). Lastly, the state matrix is re-assembled into a 128-bit register. Each of these operations works on the same state matrix and depends on the output of the previous operation. Instead of duplicating the state matrix for ever logical operation, I decided to store one state matrix and create a state machine to give each step consent to operate on the state matrix. At the time of writing, I don't know if this is a "good" design or not. It is, admittedly, a non-parallel software-like design choice. If/when I learn more about design patterns in CDL, I may revisit this. I am happy to have a working encryption scheme and I'm not opposed to refactoring/refining when possible. I will describe the current design in more detail as I go through the module.

First, I modified the module declaration to include a reset signal, `reset`

, a 128-bit data input register, `data_in`

and a 128-bit data output register, `data_out`

.

```
module AES256
(
input clock, //Clock signal
input enable, //Enable signal
input reset, //Reset signal
input [255:0] key, //Input key
input [127:0] data_in, //Input data
output reg [127:0] data_out, //Output data
output wire valid //Valid signal
);
```

Next, I enumerated the various consent bits for each step of the process. I will describe how these are used below.

```
//Init key consent
localparam KEY_INIT = 12'h1;
//Key word consent
localparam KEY_WORDS = 12'h2;
//Key scheduling consent
localparam KEY_SCHED = 12'h4;
//Init matrix consent
localparam MATRIX_INIT = 12'h8;
//First round key consent
localparam ADD_R0_KEY = 12'h10;
//Sub bytes consent
localparam SUB_BYTE = 12'h20;
//Shift row consent
localparam SHIFT_ROW = 12'h40;
//Mix col consent
localparam MIX_COL = 12'h80;
//Add key consent
localparam ADD_KEY = 12'h100;
//Output data consent
localparam OUT_CONS = 12'h200;
//Validity consent
localparam OP_DONE = 12'h400;
```

I then added a parameter to hold the number of rounds; in the case of AES256, this value is 14. Below that, I added a register to count the rounds and a consent register to hold the current state of the consent state machine.

```
//Round count
localparam NUM_ROUNDS = 8'hE;
//Round counter
reg [7:0] round_counter;
//Matrix consent register
reg[10:0] consent_reg;
```

Next, I added a multidimensional array to represent the state matrix described above.

```
//State matrix
reg [7:0] matrix[0:3][0:3];
```

I added an `always`

block to manage the state of the consent state machine. The state machine only gives consent to the other blocks if the operation isn't already finished. I make this check at the top of the block.

```
if(consent_reg != OP_DONE)
begin
```

If this check clears, I step through a case statement moving from one consent to the next in the order that the operations occur. (I will go into the implementation of each step later in this post)

`KEY_INIT`

- Performs the initial split of the input key into words`KEY_WORDS`

- Calculates the remaining words that make up the round keys`KEY_SCHED`

- Concatenates the key words into the round keys and stores them for use`MATRIX_INIT`

- Initializes the state matrix with the input data bytes`ADD_R0_KEY`

- Adds the first round key to the state matrix

The variable `round_counter`

is used to track the encryption rounds and acts as an index into the round key array. This value is incremented after the 0th round key is used in the `ADD_R0_KEY`

state.

```
case(consent_reg)
0: consent_reg = KEY_INIT;
KEY_INIT: consent_reg = KEY_WORDS;
KEY_WORDS: consent_reg = KEY_SCHED;
KEY_SCHED: consent_reg = MATRIX_INIT;
MATRIX_INIT: consent_reg = ADD_R0_KEY;
ADD_R0_KEY:
begin
round_counter += 1;
consent_reg = SUB_BYTE;
end
SUB_BYTE: consent_reg = SHIFT_ROW;
```

The encryption rounds comprise of the next four states:

`SUB_BYTE`

- Performs the byte substitution on the state matrix`SHIFT_ROW`

- Performs the row rotations on the state matrix`MIX_COL`

- Performs the column mixing operation on the state matrix`ADD_KEY`

- Adds the current round key to the state matrix

In the `SHIFT_ROW`

state, the transition depends on the encryption round. In the final round of encryption, the `MIX_COL`

state is skipped.

```
SHIFT_ROW:
begin
if(round_counter < 13)
consent_reg = MIX_COL;
else
consent_reg = ADD_KEY;
end
MIX_COL: consent_reg = ADD_KEY;
```

Likewise, in the `ADD_KEY`

state, the transition returns to the byte substitution step, `SUB_BYTE`

, unless the 14 encryption rounds are complete, in which case it proceeds to the `OUT_CONS`

state. In the `OUT_CONS`

state, the state matrix is re-assembled into the final 128-bit field. The `OUT_CONS`

state then transitions to the `OP_DONE`

state. Each `ADD_KEY`

step indicates the completion of one round so the `round_counter`

is incremented.

```
ADD_KEY:
begin
if(round_counter < 14)
consent_reg = SUB_BYTE;
else
begin
consent_reg = OUT_CONS;
end
round_counter += 1;
end
OUT_CONS: consent_reg = OP_DONE;
endcase
end
end
```

I covered the key scheduling portion of the algorithm in my previous posts. The main difference here is that the operations are broken up into discrete `always`

blocks that wait for consent from the state machine. The underlying implementation is the same for these steps as that post. For example, the `KEY_INIT`

state responsible for breaking the key into words checks for consent before operating on the key.

```
//Initial split of the key into words
always @(posedge clock)
begin: key_split_op
if(consent_reg == KEY_INIT)
begin
///Split the input key into the first 8 words
key_words[0] = key[255:224];
key_words[1] = key[223:192];
key_words[2] = key[191:160];
key_words[3] = key[159:128];
key_words[4] = key[127:96];
key_words[5] = key[95:64];
key_words[6] = key[63:32];
key_words[7] = key[31:0];
end
end
```

The `MATRIX_INIT`

state is responsible for splitting the input data into a 4x4 byte matrix. I hardcoded the assignment of the data bytes to their respective positions in the state matrix.

```
//Initialize the state matrix with the input data
always @(posedge clock)
begin: init_matrix
if(consent_reg == MATRIX_INIT)
begin
matrix[3][3] = data_in[7:0];
matrix[2][3] = data_in[15:8];
matrix[1][3] = data_in[23:16];
//...
matrix[0][0] = data_in[127:120];
end
end
```

The `ADD_R0_KEY`

state adds the first round key to the state matrix. This state and the `ADD_KEY`

state are both carried out by the same always block. The two states perform the same action but are identified uniquely because they have different exit transitions. In the `ADD_R0_KEY`

state, the variable `round_counter`

is `0`

, so the first (0th) round key is taken from the `round_keys`

array. In GF(2^n) Galois fields, the XOR operation is equivalent to addition and subtraction, so the operation of adding the round key reduces to a byte-by-byte XOR of the state matrix.

```
//Add the round key
always @(posedge clock)
begin: r0_key
if(consent_reg == ADD_R0_KEY || consent_reg == ADD_KEY)
begin
matrix[3][3] = matrix[3][3] ^ round_keys[round_counter][7:0];
matrix[2][3] = matrix[2][3] ^ round_keys[round_counter][15:8];
matrix[1][3] = matrix[1][3] ^ round_keys[round_counter][23:16];
matrix[0][3] = matrix[0][3] ^ round_keys[round_counter][31:24];
matrix[3][2] = matrix[3][2] ^ round_keys[round_counter][39:32];
matrix[2][2] = matrix[2][2] ^ round_keys[round_counter][47:40];
matrix[1][2] = matrix[1][2] ^ round_keys[round_counter][55:48];
matrix[0][2] = matrix[0][2] ^ round_keys[round_counter][63:56];
matrix[3][1] = matrix[3][1] ^ round_keys[round_counter][71:64];
matrix[2][1] = matrix[2][1] ^ round_keys[round_counter][79:72];
matrix[1][1] = matrix[1][1] ^ round_keys[round_counter][87:80];
matrix[0][1] = matrix[0][1] ^ round_keys[round_counter][95:88];
matrix[3][0] = matrix[3][0] ^ round_keys[round_counter][103:96];
matrix[2][0] = matrix[2][0] ^ round_keys[round_counter][111:104];
matrix[1][0] = matrix[1][0] ^ round_keys[round_counter][119:112];
matrix[0][0] = matrix[0][0] ^ round_keys[round_counter][127:120];
end
end
```

The `SUB_BYTE`

state performs a byte-by-byte substitution of the state matrix using the AES S-box. I added an `always`

block to carry out this operation by iterating over each column and row, calling the `sbox`

function on each byte. The `sbox`

function and the S-box table are the same used for all other S-box operations in the algorithm. I covered the AES S-box in my previous post about the key scheduling portion of the algorithm.

```
//S-Box function
function [7:0] sbox;
input [7:0] b;
begin
sbox = s_box[b[7:4]][b[3:0]];
end
endfunction
//Sub bytes layer
always @(posedge clock)
begin: sub_bytes
integer row;
integer col;
if (consent_reg == SUB_BYTE)
begin
for(row = 0; row < 4; row++)
begin
for(col=0; col<4; col++)
begin
matrix[row][col] = sbox(matrix[row][col]);
end
end
end
end
```

The `SHIFT_ROW`

state performs a row-by-row rotation of the state matrix. The first row remains as-is, the second row is rotated one byte to the left, the third row two bytes to the left, and the fourth row three bytes to the left. I created an always block to carry out the rotation in single-step increments. I repeated the left-shift step in a loop `row`

times for each row 1-3.

```
//shift rows layer
always @(posedge clock)
begin: shift_row
integer row;
integer iter;
reg [7:0] temp_byte;
if (consent_reg == SHIFT_ROW)
begin
for(row = 1; row < 4; row++)
begin
for(iter = 0; iter < row; iter++)
begin
temp_byte = matrix[row][0];
matrix[row][0] = matrix[row][1];
matrix[row][1] = matrix[row][2];
matrix[row][2] = matrix[row][3];
matrix[row][3] = temp_byte;
end
end
end
end
```

In the Mix Columns operation, each column of the state matrix is multiplied by a constant matrix defined in the AES specification. The `MIX_COL`

state carries out this matrix multiplication. All of the multiplication and addition operations are performed in GF(2). Several elements within the constant matrix are `1`

and are ignored during the multiplication operation. The `MIX_COL`

`always`

block loops over each column performing the matrix multiplication.

```
//Mix columns layer
always @(posedge clock)
begin: mix_col
integer col;
reg [7:0] temp_col[0:3];
if(consent_reg == MIX_COL)
begin
for(col = 0; col < 4; col++)
begin
temp_col[0] = gf2mult(2, matrix[0][col]) ^ gf2mult(3, matrix[1][col]) ^ matrix[2][col] ^ matrix[3][col];
temp_col[1] = matrix[0][col] ^ gf2mult(2, matrix[1][col]) ^ gf2mult(3, matrix[2][col]) ^ matrix[3][col];
temp_col[2] = matrix[0][col] ^ matrix[1][col] ^ gf2mult(2, matrix[2][col]) ^ gf2mult(3, matrix[3][col]);
temp_col[3] = gf2mult(3, matrix[0][col]) ^ matrix[1][col] ^ matrix[2][col] ^ gf2mult(2, matrix[3][col]);
matrix[0][col] = temp_col[0];
matrix[1][col] = temp_col[1];
matrix[2][col] = temp_col[2];
matrix[3][col] = temp_col[3];
end
end
end
```

The function `gf2Mult`

is a helper function that performs multiplication in GF(2). The function uses the shift-and-add technique for multiplication. The highest bit of the `x`

operand is checked. If this bit is set, the polynomial that `x`

represents in GF(2) is of a degree higher than the irreducible polynomial, `P`

, that defines the AES finite field and, as a result, must be reduced by subtracting away `P`

. `P`

is represented in hex `0x11B`

. Because this multiplication is a single-byte operation, `P`

is truncated to the lower 8 bits, or `0x1B`

. I described this operation and some of the mathematical basis of it in more detail in my previous post.

```
//Multiplication in GF
function [7:0] gf2mult;
input reg [7:0] x;
input reg [7:0] y;
integer i;
reg [7:0] b;
begin
gf2mult = 0;
for(i = 0; i < 8; i++)
begin
if(y & 1'b1)
begin
gf2mult = gf2mult ^ x;
end
b = (x & 8'h80);
x = (x << 1);
if(b)
x = x ^ 8'h1B;
y = (y >> 1);
end
end
endfunction
```

The `ADD_KEY`

state re-uses the same block as the `ADD_R0_KEY`

state.

The `OUT_CONS`

state is responsible for reassembling the 4x4 state matrix into the 128-bit output register. This is performed by the final `always`

block in the encryption process.

```
//Construct the output data from the state matrix
always @(posedge clock)
begin : outdata
integer c;
integer r;
reg [7:0] b;
if(consent_reg == OUT_CONS)
begin
for(c = 0; c < 4; c++)
begin
for(r = 0; r < 4; r++)
begin
data_out = data_out << 8;
b = matrix[r][c];
data_out |= b;
end
end
end
end
```

When the encryption process is complete, the consent state machine transitions into the `OP_DONE`

state. If the consent register is in this state, the `valid`

output bit is set to indicate that the output data is valid.

```
//Set valid signal if the operation is complete
assign valid = (consent_reg == OP_DONE);
```

The state machine is reset if the key is changed, input data is changed, or the module receives a reset signal. The `always`

block that performs the reset sets the consent register to the initial state and restarts the round counter.

```
//Reset if the input parameters change
always @(key, data_in, reset)
begin : reset_op
consent_reg = 0;
round_counter = 0;
end
```

I created a test bench to drive the AES module. First, I added registers and wires for the module inputs and outputs.

```
module test_bench;
//Clock signal
reg clock = 0;
//Input key
reg [255:0] key;
//Input data
reg [127:0] data_in;
//Enable signal
reg enable;
//Reset signal
reg reset;
//Output data
wire [127:0] data_out;
//Valid signal
wire valid;
//Valid register
reg valid_reg;
```

Next I declared an instance of the AES module, making the appropriate connections.

```
//DES block under test
AES256 uut (
.clock(clock),
.enable(enable),
.reset(reset),
.key(key),
.data_in(data_in),
.data_out(data_out),
.valid(valid)
);
```

I set the key, set the input data, and enabled the module.

```
initial
begin : test
valid_reg = 0;
//Set the key
key = 256'h97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d8;
$display("Input key: \n %x \n", key);
//Set the input data
data_in = 128'he536638ecbcec0be6ce6a97e98da827b;
$display("Input data: \n %x \n", data_in);
//Set enabled
enable = 1;
```

I triggered the clock in a loop for a maximum of 100 clock cycles. The loop exits early if the validity bit is set indicating that the encryption process has completed. A message is printed if the process times out. The resulting output data, number of clock cycles and the state of the validity bit are printed.

```
$display("Starting encryption.");
//Trigger the clock until the valid bit is set or it times out
for(i = 0; (i < 100) && (valid_reg != 1); i++)
begin
#5 clock = 1;
#5 clock = 0;
end
if(i == 100)
$display("Timed out!");
//Verify the valid bit is set
$display("Clock cycles: %d", i);
$display("Data out: %x", data_out);
$display("Valid: %b", valid_reg);
```

Next I changed the key to exercise the key change reset functionality.

```
//Change the key
key = key + 1;
$display("\nSetting a new key: \n %x \n", key);
//Trigger the clock
#5 clock = 1;
#5 clock = 0;
//Verify the validity bit is cleared
$display("Valid after new key: %d", valid_reg);
```

I exercise the key change reset functionality by changing the key, triggering the clock and verifying that the validity bit is no longer set.

```
//Change the key
key = key + 1;
$display("\nSetting a new key: \n %x \n", key);
//Trigger the clock
#5 clock = 1;
#5 clock = 0;
//Verify the validity bit is cleared
$display("Valid after new key: %d", valid_reg);
```

After repeating the encryption/printing cycle described above, I test the `reset`

signal functionality by setting the reset bit and verifying that the validity bit is no longer set.

```
//Verify reset works
$display("\nChecking reset.");
reset = 1;
#5 clock = 1;
#5 clock = 0;
reset = 0;
#5 clock = 1;
#5 clock = 0;
$display("Valid after reset: %d", valid_reg);
```

As before, I repeated the encryption and output printing cycle. Lastly, I verified that a change in the input data results in a reset of the AES module before encrypting and printing the output a final time.

```
//Verify reset on input data change
$display("\nChecking reset on data change.");
data_in = data_in + 1;
$display("Setting new input data: %x", data_in);
#5 clock = 1;
#5 clock = 0;
$display("Valid after data change: %d", valid_reg);
```

Executing the test bench results in the following output:

```
Input key:
97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d8
Input data:
e536638ecbcec0be6ce6a97e98da827b
Starting encryption.
Clock cycles: 63
Data out: 6034088a2dedde69013d073e8681d21c
Valid: 1
Setting a new key:
97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d9
Valid after new key: 0
Starting encryption.
Clock cycles: 62
Data out: 1009ddfd3e0fff0ee3d1f63944fc14f4
Valid: 1
Checking reset.
Valid after reset: 0
Starting encryption.
Clock cycles: 62
Data out: 1009ddfd3e0fff0ee3d1f63944fc14f4
Valid: 1
Checking reset on data change.
Setting new input data: e536638ecbcec0be6ce6a97e98da827c
Valid after data change: 0
Starting encryption.
Clock cycles: 62
Data out: 549d22264b0dabc2a9b8f15d27310a80
Valid: 1
```

To verify the output of the script, I used my Python implementation of AES covered in my last post (itself verified against Python's PyCrypto library.) Using the same initial key and input data, I repeated the same set of cryptographic operations.

```
#The input key
input_key = 0x97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d8
#One block of input data
input_data = 0xe536638ecbcec0be6ce6a97e98da827b
print("First key")
crypt(input_key, input_data)
print("Key 2")
input_key = input_key + 1
crypt(input_key, input_data)
print("Data 2")
input_data = input_data + 1
crypt(input_key, input_data)
```

The output of this script matches the output of my test bench.

```
First key
Input key:
0x97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d8
Input data:
0xe536638ecbcec0be6ce6a97e98da827b
Cipher text:
0x6034088a2dedde69013d073e8681d21c
Key 2
Input key:
0x97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d9
Input data:
0xe536638ecbcec0be6ce6a97e98da827b
Cipher text:
0x1009ddfd3e0fff0ee3d1f63944fc14f4
Data 2
Input key:
0x97247d91d32fa1f6bece5da9bfe61c1a3b32edf26fd6ec2a6187ba777fc3c1d9
Input data:
0xe536638ecbcec0be6ce6a97e98da827c
Cipher text:
0x549d22264b0dabc2a9b8f15d27310a80
```

While I'm certain there are optimizations/improvements and potentially a complete refactor that might improve my implementation, I'm still happy with the result. In future posts, I am going to continue refining this design and continue working towards a useful, software-accessible AES peripheral.