AES in CDL Part 1 - Key Schedule in Python

I began implementing the Advanced Encryption Standard (AES) encryption algorithm in Verilog with the intention of creating a configurable digital logic (CDL) encryption engine. This post, the first in a series, covers the key expansion (key scheduling) portion of the implementation. I've decided to cover the algorithm itself in more detail than I did in my DES implementation.

Disclaimer: I am a complete Verilog novice. I'm certain there are bad design/bad practice/bugs in this implementation. Eventually I intend for someone knowledgeable to review and critique my work in an effort to improve.

AES 256 Key Schedule in Python

Like my DES implementation, I decided to implement the AES algorithm in Python before writing it in Verilog. This was extremely useful for design/debugging. I will cover each part of the key generation process, the S-box, H function, G function, and the main key generation algorithm. First, I created a look up table for the substition and inverse substition boxes. The AES S-box is a short cut around calculating the substitions.

The first step in the substition is to calculate the multiplicative inverse of the input byte in the Finite Field GF(2^8), i.e. B' = A^-1 such that A * B' is congruent to 1 mod P where P is the irreducible polynomial that defines the AES field, x^8 + x^4 + x^3 + x + 1. That polynomial is defined in the AES specification and is used to derive all of the elements in GF(2^8). B' is then multiplied by a constant matrix, then addd to another constant matrix in an operation called Affine mapping. This calculation is done modulo 2 and the resulting byte is the substition value. Because there exists no multiplicative inverse of 0, 0 is treated as a special case and mapped to itself to facilitate inversion.

Because there only exist 256 possible inputs to the substition calculation, it is easy enough to map each input to it's output, resulting in the S-box table. I created both the normal and inverse S-boxes using multi-dimensional arrays.

#AES S-box
s_box = [
    [0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76],
    [0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0],
    [0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15],
    [0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75],
    [0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84],
    [0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf],
    [0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8],
    [0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2],
    [0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73],
    [0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb],
    [0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79],
    [0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08],
    [0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a],
    [0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e],
    [0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf],
    [0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16]
]

#AES Inverse S-box
inv_s_box = [
    [0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38, 0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb],
    [0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87, 0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb],
    [0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d, 0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e],
    [0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2, 0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25],
    [0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16, 0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92],
    [0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda, 0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84],
    [0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a, 0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06],
    [0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02, 0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b],
    [0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea, 0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73],
    [0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85, 0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e],
    [0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89, 0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b],
    [0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20, 0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4],
    [0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31, 0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f],
    [0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d, 0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef],
    [0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0, 0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61],
    [0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26, 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d]
]

Nex I pre-calculated values known as the Round Constants, the use of which is covered below. The round constants represent members of the field GF(2^8). The polynomials in GF(2^8) can be represented in binary form by assigning the coefficients to bits. I.E:

|MSB | B6 | B5  | B4  | B3  | B2  | B1  |LSB  |
|----|----|-----|-----|-----|-----|-----|-----|
|X^7 | X^6| x^5 | X^4 | X^3 | X^2 | X^1 | x^0 |

For example, the polynomial x^5 + x^4 + x^2 + x can be represented in binary as 00110110. The round coefficients are calculated by doubling the previous constant, starting with C0 = 1. This is equivalent to multiplying each member by x, keeping in mind all arithmitec is done in GF(2^8) using P. The following table shows the constants in algebraic and hexidecimal representation.

| Round Constant     | Algebraic           | Hexadecimal |
|--------------------|---------------------|-------------|
| C1                 | 1                   | 0x1         |
| C2                 | X                   | 0x2         |
| C3                 | X^2                 | 0x4         |
| C4                 | X^3                 | 0x8         |
| C5                 | X^4                 | 0x10        |
| C6                 | X^5                 | 0x20        |
| C7                 | X^6                 | 0x40        |
| C8                 | X^7                 | 0x80        |
| C9                 | X^4 + X^3 + X + 1   | 0x1B        |
| C10                | X^5 + X^4 + X^2 + X | 0x36        |

Note that the step of doubling C8 would result in a polynomial of a higher degree than the P polynomial and must be reduced modulo P. Addition and subtraction of coefficients are done in the prime field GF(2); i.e. calculations are done modulo 2. For example, 12 + 20 in GF2:

12 = 2^3 + 2^2
20 = 2^4 + 2^2
   = 2^4 + 2^3 + 2*2^3 
   = 2^4 + 2^3 + 0  //2*2 = 4, 4 % 2 = 0 
   = 24

The binary representation of these polynomials brings up the important fact that addition and subtraction operations in GF(2) are equivalent to the XOR operation. Note that these binary representations are not of the integer value, it's the encoding of the polynomial.

12  = 01100
20  = 10100
XOR = 11000
    = 2^4 + 2^3
    = 24

As a result, reducing the higher-degree polynomial round constants can be done via XOR with P. For example (recall C8 = X^7):

C9 = 2^7 * 2
C10'  = 2^8 = 100000000

P = x^5 + x^4 + x^2 + x = 000100011011

C10 = C10' ^ P
    = 100000000 ^ 000100011011
    = 11011
    = 0x1B

Because there are only 10 round constants needed in the key schedule for AES 256, it makes sense to pre-calculate them and store them in a table. However, I chose to actually perform the calculation in my python script to illustrate the operations described above.

# Calculate the Round Constants
#0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36
r_con = [None] * 10
r_con[0] = 1
for x in range(1,10):
    if(r_con[x-1] >= 0x80):
        #If the result > 0x80, reduce by the AES irreducible polynomial
        r_con[x] = (2 * r_con[x-1] ) ^ 0x11B
    else:
        #otherwise just double it as normal
        r_con[x] = 2 * r_con[x-1]

Next I added two routines for the S-box and inverse S-box operations. The former is required for the key schedule.

#S-box substitution
def sbox(bIn):
    col = bIn & 0xF
    row = (bIn >> 4) & 0xF
    return s_box[row][col]

#Inverse S-box substitution
def invsbox(bIn):
    col = bIn & 0xF
    row = (bIn >> 4) & 0xF
    return inv_s_box[row][col]

The first step in the key schedule is to split the user provided key into 32-bit words. Fifteen 128-bit round keys equates to 60 32-bit words total. The key words are ordered from left to right, i.e. the most significant bytes of the key become W0, the least significant bytes become W7. The argument to the following routine is expected to be the 60-word array. This function is used in the key generation routine covered later.

#Split the 256-bit key into 32-bit words
def splitkey(inkey, words):
    shift = 224
    for x in range(8):
        words[x] = (inkey >> shift) & 0xFFFFFFFF
        shift -= 32
    return words

Next, I added a routine for a byte-wise left rotation of a word. The routine shifts every byte to the left, except the left-most bit that wraps to the right-hand side.

#Helper function to byte-wise left rotate a 32-bit word
def l_rotate_word(word):
    word = ((word << 8) & 0xFFFFFF00)  | ((word >> 24) & 0xFF)
    return word

The next function, called G, removes linearity and symmetry from the AES key schedule. The word is first left-rotated using the function above. Next, each byte is substituted using the S-box substitition. The left-most byte is XOR'd with the round constant for the given round.

#The AES G function
def g(wIn, rc):
    wIn = l_rotate_word(wIn)
    w0 = wIn >> 24 & 0xFF
    w1 = wIn >> 16 & 0xFF
    w2 = wIn >> 8 &  0xFF
    w3 = wIn & 0xFF
    w0 = sbox(w0) ^ rc
    w1 = sbox(w1)
    w2 = sbox(w2)
    w3 = sbox(w3)
    ret = w0 << 24 | w1 << 16 | w2 << 8 | w3
    return ret

Next, I created the H function. H performs a simple byte-by-byte substitution on a word using the same S-box substitution.

#The AES 256 H-function
def h(wIn):
    w0 = sbox(wIn >> 24 & 0xFF)
    w1 = sbox(wIn >> 16 & 0xFF)
    w2 = sbox(wIn >> 8 &  0xFF)
    w3 = sbox(wIn & 0xFF)
    ret = w0 << 24 | w1 << 16 | w2 << 8 | w3
    return ret

The main key generation function makes use of the previous two functions when generating the remaining words. W8 - W15 are derived from W0 - W7 and so on. The first word in the next group is obtained by XOR'ing the first word of the previous group with the result of G of the seventh word of the previous group.

W8 = W0 ^ G(W7, round_constant[0])

The fifth word of the word in the group is obtained by XOR'ing the fifth word of the previous group with the result of H of the fourth word of the current group.

W12 = H(W11) ^ W4

The other words of the current group are obtained by XORing the Nth word in the previous group with the N-1 word in the previous group.

W10 = W2 ^ W1

This pattern repeats for every group of words. Lastly, the words are concatenated to form the round keys. The routine below performs these operations.

#Expand the 256-bit key into the 15 round keys
def key_expansion(inkey):
    #Split the initial key into words 0-7
    words = [None] * 60
    words = splitkey(inkey, words)
    
    #Start calculating the remaining words
    rconIdx = 0
    for x in range(8,60):
        if x % 8 == 0:
            #every 8th word uses the G function
            words[x] = g(words[x-1], r_con[rconIdx]) ^ words[x-8]
            rconIdx = rconIdx + 1
        elif x % 4 == 0:
            #Every other fourth word uses the H function
            words[x] = h(words[x-1]) ^ words[x-8]
        else:
            #Otherwise use a simple XOR
            words[x] = words[x-1] ^ words[x-8]

    #every 4 words forms a subkey
    keyIdx = 0
    keys = [None] * 15
    for x in range(61):
        if x != 0 and (x % 4) == 0:
            keys[keyIdx] = (words[x-4] << 96) | (words[x-3] << 64) | (words[x-2] << 32) | words[x-1]
            keyIdx = keyIdx + 1

    return keys

Running the python script results in the following round keys (given the input key below).

Input key:
68363713280131350524037942004600087937288646492582532878035616528111639970264

Round keys:
01 - 0x97247d91d32fa1f6bece5da9bfe61c1a
02 - 0x3b32edf26fd6ec2a6187ba777fc3c1d8
03 - 0xb85c1c436b73bdb5d5bde01c6a5bfc06
04 - 0x390b5d9d56ddb1b7375a0bc04899ca18
05 - 0x5428b1113f5b0ca4eae6ecb880bd10be
06 - 0xf4719733a2ac268495f62d44dd6fe75c
07 - 0xf8bcfbd0c7e7f7742d011bccadbc0b72
08 - 0x6114bc73c3b89af7564eb7b38b2150ef
09 - 0xdef24edca08d399e709c8554ab5c327
10 - 0xb7c192bf747908482237bffba916ef14
11 - 0x5a30de3e90380da77731c5f23d8406d5
12 - 0x909efdbce4e7f5f4c6d04a0f6fc6a51b
13 - 0xce3671965e0e7c31293fb9c314bbbf16
14 - 0x6a74f5fb8e93000f48434a002785ef1b
15 - 0x19e9de5a47e7a26b6ed81ba87a63a4be

Summary

This python script will provide me with a roadmap to my Verilog implementation and facilitate debugging. In the next post, I will cover the Verilog implementation and testing. In later posts, I'll continue developing the full AES256 algorithm.