I began implementing the Advanced Encryption Standard (AES) encryption algorithm in Verilog with the intention of creating a configurable digital logic (CDL) encryption engine. This post, the first in a series, covers the key expansion (key scheduling) portion of the implementation. I've decided to cover the algorithm itself in more detail than I did in my DES implementation.

Disclaimer: I am a complete Verilog novice. I'm certain there are bad design/bad practice/bugs in this implementation. Eventually I intend for someone knowledgeable to review and critique my work in an effort to improve.

## AES 256 Key Schedule in Python

Like my DES implementation, I decided to implement the AES algorithm in Python before writing it in Verilog. This was extremely useful for design/debugging. I will cover each part of the key generation process, the S-box, `H` function, `G` function, and the main key generation algorithm. First, I created a look up table for the substition and inverse substition boxes. The AES S-box is a short cut around calculating the substitions.

The first step in the substition is to calculate the multiplicative inverse of the input byte in the Finite Field GF(2^8), i.e. `B' = A^-1` such that `A * B' is congruent to 1 mod P` where `P` is the irreducible polynomial that defines the AES field, `x^8 + x^4 + x^3 + x + 1`. That polynomial is defined in the AES specification and is used to derive all of the elements in GF(2^8). `B'` is then multiplied by a constant matrix, then addd to another constant matrix in an operation called Affine mapping. This calculation is done modulo 2 and the resulting byte is the substition value. Because there exists no multiplicative inverse of `0`, `0` is treated as a special case and mapped to itself to facilitate inversion.

Because there only exist 256 possible inputs to the substition calculation, it is easy enough to map each input to it's output, resulting in the S-box table. I created both the normal and inverse S-boxes using multi-dimensional arrays.

``````#AES S-box
s_box = [
[0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76],
[0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0],
[0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15],
[0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75],
[0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84],
[0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf],
[0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8],
[0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2],
[0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73],
[0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb],
[0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79],
[0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08],
[0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a],
[0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e],
[0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf],
[0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16]
]

#AES Inverse S-box
inv_s_box = [
[0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38, 0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb],
[0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87, 0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb],
[0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d, 0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e],
[0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2, 0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25],
[0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16, 0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92],
[0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda, 0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84],
[0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a, 0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06],
[0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02, 0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b],
[0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea, 0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73],
[0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85, 0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e],
[0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89, 0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b],
[0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20, 0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4],
[0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31, 0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f],
[0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d, 0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef],
[0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0, 0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61],
[0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26, 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d]
]
``````

Nex I pre-calculated values known as the Round Constants, the use of which is covered below. The round constants represent members of the field GF(2^8). The polynomials in GF(2^8) can be represented in binary form by assigning the coefficients to bits. I.E:

``````|MSB | B6 | B5  | B4  | B3  | B2  | B1  |LSB  |
|----|----|-----|-----|-----|-----|-----|-----|
|X^7 | X^6| x^5 | X^4 | X^3 | X^2 | X^1 | x^0 |
``````

For example, the polynomial `x^5 + x^4 + x^2 + x` can be represented in binary as `00110110`. The round coefficients are calculated by doubling the previous constant, starting with `C0 = 1`. This is equivalent to multiplying each member by `x`, keeping in mind all arithmitec is done in GF(2^8) using `P`. The following table shows the constants in algebraic and hexidecimal representation.

``````| Round Constant     | Algebraic           | Hexadecimal |
|--------------------|---------------------|-------------|
| C1                 | 1                   | 0x1         |
| C2                 | X                   | 0x2         |
| C3                 | X^2                 | 0x4         |
| C4                 | X^3                 | 0x8         |
| C5                 | X^4                 | 0x10        |
| C6                 | X^5                 | 0x20        |
| C7                 | X^6                 | 0x40        |
| C8                 | X^7                 | 0x80        |
| C9                 | X^4 + X^3 + X + 1   | 0x1B        |
| C10                | X^5 + X^4 + X^2 + X | 0x36        |
``````

Note that the step of doubling `C8` would result in a polynomial of a higher degree than the `P` polynomial and must be reduced modulo `P`. Addition and subtraction of coefficients are done in the prime field GF(2); i.e. calculations are done modulo 2. For example, `12 + 20` in GF2:

``````12 = 2^3 + 2^2
20 = 2^4 + 2^2
= 2^4 + 2^3 + 2*2^3
= 2^4 + 2^3 + 0  //2*2 = 4, 4 % 2 = 0
= 24
``````

The binary representation of these polynomials brings up the important fact that addition and subtraction operations in GF(2) are equivalent to the XOR operation. Note that these binary representations are not of the integer value, it's the encoding of the polynomial.

``````12  = 01100
20  = 10100
XOR = 11000
= 2^4 + 2^3
= 24
``````

As a result, reducing the higher-degree polynomial round constants can be done via XOR with `P`. For example (recall `C8 = X^7`):

``````C9 = 2^7 * 2
C10'  = 2^8 = 100000000

P = x^5 + x^4 + x^2 + x = 000100011011

C10 = C10' ^ P
= 100000000 ^ 000100011011
= 11011
= 0x1B

``````

Because there are only 10 round constants needed in the key schedule for AES 256, it makes sense to pre-calculate them and store them in a table. However, I chose to actually perform the calculation in my python script to illustrate the operations described above.

``````# Calculate the Round Constants
#0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36
r_con = [None] * 10
r_con[0] = 1
for x in range(1,10):
if(r_con[x-1] >= 0x80):
#If the result > 0x80, reduce by the AES irreducible polynomial
r_con[x] = (2 * r_con[x-1] ) ^ 0x11B
else:
#otherwise just double it as normal
r_con[x] = 2 * r_con[x-1]
``````

Next I added two routines for the S-box and inverse S-box operations. The former is required for the key schedule.

``````#S-box substitution
def sbox(bIn):
col = bIn & 0xF
row = (bIn >> 4) & 0xF
return s_box[row][col]

#Inverse S-box substitution
def invsbox(bIn):
col = bIn & 0xF
row = (bIn >> 4) & 0xF
return inv_s_box[row][col]
``````

The first step in the key schedule is to split the user provided key into 32-bit words. Fifteen 128-bit round keys equates to 60 32-bit words total. The key words are ordered from left to right, i.e. the most significant bytes of the key become `W0`, the least significant bytes become `W7`. The argument to the following routine is expected to be the 60-word array. This function is used in the key generation routine covered later.

``````#Split the 256-bit key into 32-bit words
def splitkey(inkey, words):
shift = 224
for x in range(8):
words[x] = (inkey >> shift) & 0xFFFFFFFF
shift -= 32
return words
``````

Next, I added a routine for a byte-wise left rotation of a word. The routine shifts every byte to the left, except the left-most bit that wraps to the right-hand side.

``````#Helper function to byte-wise left rotate a 32-bit word
def l_rotate_word(word):
word = ((word << 8) & 0xFFFFFF00)  | ((word >> 24) & 0xFF)
return word
``````

The next function, called `G`, removes linearity and symmetry from the AES key schedule. The word is first left-rotated using the function above. Next, each byte is substituted using the S-box substitition. The left-most byte is XOR'd with the round constant for the given round.

``````#The AES G function
def g(wIn, rc):
wIn = l_rotate_word(wIn)
w0 = wIn >> 24 & 0xFF
w1 = wIn >> 16 & 0xFF
w2 = wIn >> 8 &  0xFF
w3 = wIn & 0xFF
w0 = sbox(w0) ^ rc
w1 = sbox(w1)
w2 = sbox(w2)
w3 = sbox(w3)
ret = w0 << 24 | w1 << 16 | w2 << 8 | w3
return ret
``````

Next, I created the `H` function. `H` performs a simple byte-by-byte substitution on a word using the same S-box substitution.

``````#The AES 256 H-function
def h(wIn):
w0 = sbox(wIn >> 24 & 0xFF)
w1 = sbox(wIn >> 16 & 0xFF)
w2 = sbox(wIn >> 8 &  0xFF)
w3 = sbox(wIn & 0xFF)
ret = w0 << 24 | w1 << 16 | w2 << 8 | w3
return ret
``````

The main key generation function makes use of the previous two functions when generating the remaining words. `W8` - `W15` are derived from `W0` - `W7` and so on. The first word in the next group is obtained by XOR'ing the first word of the previous group with the result of `G` of the seventh word of the previous group.

``````W8 = W0 ^ G(W7, round_constant[0])

``````

The fifth word of the word in the group is obtained by XOR'ing the fifth word of the previous group with the result of `H` of the fourth word of the current group.

``````W12 = H(W11) ^ W4

``````

The other words of the current group are obtained by XORing the Nth word in the previous group with the N-1 word in the previous group.

``````W10 = W2 ^ W1

``````

This pattern repeats for every group of words. Lastly, the words are concatenated to form the round keys. The routine below performs these operations.

``````#Expand the 256-bit key into the 15 round keys
def key_expansion(inkey):
#Split the initial key into words 0-7
words = [None] * 60
words = splitkey(inkey, words)

#Start calculating the remaining words
rconIdx = 0
for x in range(8,60):
if x % 8 == 0:
#every 8th word uses the G function
words[x] = g(words[x-1], r_con[rconIdx]) ^ words[x-8]
rconIdx = rconIdx + 1
elif x % 4 == 0:
#Every other fourth word uses the H function
words[x] = h(words[x-1]) ^ words[x-8]
else:
#Otherwise use a simple XOR
words[x] = words[x-1] ^ words[x-8]

#every 4 words forms a subkey
keyIdx = 0
keys = [None] * 15
for x in range(61):
if x != 0 and (x % 4) == 0:
keys[keyIdx] = (words[x-4] << 96) | (words[x-3] << 64) | (words[x-2] << 32) | words[x-1]
keyIdx = keyIdx + 1

return keys
``````

Running the python script results in the following round keys (given the input key below).

``````Input key:
68363713280131350524037942004600087937288646492582532878035616528111639970264

Round keys:
01 - 0x97247d91d32fa1f6bece5da9bfe61c1a
02 - 0x3b32edf26fd6ec2a6187ba777fc3c1d8
03 - 0xb85c1c436b73bdb5d5bde01c6a5bfc06
04 - 0x390b5d9d56ddb1b7375a0bc04899ca18
05 - 0x5428b1113f5b0ca4eae6ecb880bd10be
06 - 0xf4719733a2ac268495f62d44dd6fe75c
08 - 0x6114bc73c3b89af7564eb7b38b2150ef
09 - 0xdef24edca08d399e709c8554ab5c327
10 - 0xb7c192bf747908482237bffba916ef14
11 - 0x5a30de3e90380da77731c5f23d8406d5
12 - 0x909efdbce4e7f5f4c6d04a0f6fc6a51b
13 - 0xce3671965e0e7c31293fb9c314bbbf16
14 - 0x6a74f5fb8e93000f48434a002785ef1b
15 - 0x19e9de5a47e7a26b6ed81ba87a63a4be

``````

## Summary

This python script will provide me with a roadmap to my Verilog implementation and facilitate debugging. In the next post, I will cover the Verilog implementation and testing. In later posts, I'll continue developing the full AES256 algorithm.