ASCON S-box Pattern Matching for RISC-V

Hi,
We’re trying to implement pattern matching support of S-box part of ASCON encryption algorithms. ASCON’s Permutation Layer consists of a series of According to the specification of ASCON:

The round transformation consists of the following three steps which operate on a 320-bit state divided into 5 words x0, x1, x2, x3, x4 of 64 bits each:

  • Addition of Round Constants: xors a round specific 1-byte constant to word x2.
  • Nonlinear Substitution Layer: applies a 5-bit S-box 64 times in parallel in a bit-sliced fashion (vertically, across words).
  • Linear Diffusion Layer: xors different rotated copies of each word (horizontally, within each word).

S-box layer:
S-box Layer
The C code of Round is as follows which can be found here in the Ascon’s library:

static inline uint64_t ROR(uint64_t x, int n) {
  return x >> n | x << (-n & 63);
}

static inline void ROUND(ascon_state_t* s, uint8_t C) {
  ascon_state_t t;
  /* addition of round constant */
  s->x[2] ^= C;
  /* printstate(" round constant", s); */

  /* substitution layer */
  s->x[0] ^= s->x[4];
  s->x[4] ^= s->x[3];
  s->x[2] ^= s->x[1];
  /* start of keccak s-box */
  t.x[0] = s->x[0] ^ (~s->x[1] & s->x[2]);
  t.x[1] = s->x[1] ^ (~s->x[2] & s->x[3]);
  t.x[2] = s->x[2] ^ (~s->x[3] & s->x[4]);
  t.x[3] = s->x[3] ^ (~s->x[4] & s->x[0]);
  t.x[4] = s->x[4] ^ (~s->x[0] & s->x[1]);
  /* end of keccak s-box */
  t.x[1] ^= t.x[0];
  t.x[0] ^= t.x[4];
  t.x[3] ^= t.x[2];
  t.x[2] = ~t.x[2];

  /* printstate(" substitution layer", &t); */
  /* linear diffusion layer */
  s->x[0] = t.x[0] ^ ROR(t.x[0], 19) ^ ROR(t.x[0], 28);
  s->x[1] = t.x[1] ^ ROR(t.x[1], 61) ^ ROR(t.x[1], 39);
  s->x[2] = t.x[2] ^ ROR(t.x[2], 1) ^ ROR(t.x[2], 6);
  s->x[3] = t.x[3] ^ ROR(t.x[3], 10) ^ ROR(t.x[3], 17);
  s->x[4] = t.x[4] ^ ROR(t.x[4], 7) ^ ROR(t.x[4], 41);
  //printstate(" round output", s);
}

On the other hand, we have a modified Ibex RISC-V core. The substitution layer above and ROR is implemented in hardware. The interface of using this accelerator is by giving an ALU_rr type SBOX ptr_to_array, xx instruction to the core. The pointer to the array should contain 5 input integers of the algorithm. There shouldn’t be load instructions to the array as the hardware is accessing the memory directly.
I’ve created IR from the C code and included the first 2 consecutive rounds. The results get stored back after end of 12 rounds:

define internal fastcc void @P12(ptr nocapture noundef %s) unnamed_addr #2 {
entry:
  %arrayidx.i = getelementptr inbounds [5 x i64], ptr %s, i32 0, i32 2
  %0 = load i64, ptr %arrayidx.i, align 8, !tbaa !7
  %xor.i = xor i64 %0, 240 ;constant not part of S-box
  %arrayidx2.i = getelementptr inbounds [5 x i64], ptr %s, i32 0, i32 4
  %1 = load i64, ptr %arrayidx2.i, align 8, !tbaa !7
  %2 = load i64, ptr %s, align 8, !tbaa !7
  %xor5.i = xor i64 %2, %1
  %arrayidx7.i = getelementptr inbounds [5 x i64], ptr %s, i32 0, i32 3
  %3 = load i64, ptr %arrayidx7.i, align 8, !tbaa !7
  %xor10.i = xor i64 %3, %1
  %arrayidx12.i = getelementptr inbounds [5 x i64], ptr %s, i32 0, i32 1
  %4 = load i64, ptr %arrayidx12.i, align 8, !tbaa !7
  %xor15.i = xor i64 %4, %xor.i
  %not.i = xor i64 %4, -1
  %and.i = and i64 %xor.i, %not.i
  %xor22.i = xor i64 %and.i, %xor5.i
  %not29.i = xor i64 %xor15.i, -1
  %and32.i = and i64 %3, %not29.i
  %xor33.i = xor i64 %and32.i, %4
  %not40.i = xor i64 %3, -1
  %and43.i = and i64 %1, %not40.i
  %xor44.i = xor i64 %xor15.i, %and43.i
  %not51.i = xor i64 %xor10.i, -1
  %and54.i = and i64 %xor5.i, %not51.i
  %xor55.i = xor i64 %and54.i, %3
  %not62.i = xor i64 %xor5.i, -1
  %and65.i = and i64 %4, %not62.i
  %xor66.i = xor i64 %and65.i, %xor10.i
  %xor73.i = xor i64 %xor33.i, %xor22.i
  %xor78.i = xor i64 %xor22.i, %xor66.i
  %xor83.i = xor i64 %xor55.i, %xor44.i
  %not86.i = xor i64 %xor44.i, -1
       ;end of s-box
  %or.i.i = tail call i64 @llvm.fshl.i64(i64 %xor78.i, i64 %xor78.i, i64 45)
  %or.i193.i = tail call i64 @llvm.fshl.i64(i64 %xor78.i, i64 %xor78.i, i64 36)
  %5 = xor i64 %or.i.i, %or.i193.i
  %xor97.i = xor i64 %5, %xor78.i
  %or.i196.i = tail call i64 @llvm.fshl.i64(i64 %xor73.i, i64 %xor73.i, i64 3)
  %or.i199.i = tail call i64 @llvm.fshl.i64(i64 %xor73.i, i64 %xor73.i, i64 25)
  %6 = xor i64 %or.i196.i, %or.i199.i
  %xor109.i = xor i64 %6, %xor73.i
  %or.i202.i = tail call i64 @llvm.fshl.i64(i64 %not86.i, i64 %not86.i, i64 63)
  %or.i205.i = tail call i64 @llvm.fshl.i64(i64 %not86.i, i64 %not86.i, i64 58)
  %7 = xor i64 %or.i202.i, %or.i205.i
  %xor121.i = xor i64 %7, %not86.i
  %or.i208.i = tail call i64 @llvm.fshl.i64(i64 %xor83.i, i64 %xor83.i, i64 54)
  %or.i211.i = tail call i64 @llvm.fshl.i64(i64 %xor83.i, i64 %xor83.i, i64 47)
  %8 = xor i64 %or.i208.i, %or.i211.i
  %xor133.i = xor i64 %8, %xor83.i
  %or.i214.i = tail call i64 @llvm.fshl.i64(i64 %xor66.i, i64 %xor66.i, i64 57)
  %or.i217.i = tail call i64 @llvm.fshl.i64(i64 %xor66.i, i64 %xor66.i, i64 23)
  %9 = xor i64 %or.i214.i, %or.i217.i
  %xor145.i = xor i64 %9, %xor66.i
  %xor.i13 = xor i64 %xor121.i, 225
  %xor5.i15 = xor i64 %xor97.i, %xor145.i
  %xor10.i17 = xor i64 %xor133.i, %xor145.i
  %xor15.i19 = xor i64 %xor109.i, %xor.i13
  %not.i20 = xor i64 %xor109.i, -1
  %and.i21 = and i64 %xor.i13, %not.i20
  %xor22.i22 = xor i64 %and.i21, %xor5.i15
  %not29.i23 = xor i64 %xor15.i19, -1
  %and32.i24 = and i64 %xor133.i, %not29.i23
  %xor33.i25 = xor i64 %and32.i24, %xor109.i
  %not40.i26 = xor i64 %xor133.i, -1
  %and43.i27 = and i64 %xor145.i, %not40.i26
  %xor44.i28 = xor i64 %xor15.i19, %and43.i27
  %not51.i29 = xor i64 %xor10.i17, -1
  %and54.i30 = and i64 %xor5.i15, %not51.i29
  %xor55.i31 = xor i64 %and54.i30, %xor133.i
  %not62.i32 = xor i64 %xor5.i15, -1
  %and65.i33 = and i64 %xor109.i, %not62.i32
  %xor66.i34 = xor i64 %and65.i33, %xor10.i17
  %xor73.i35 = xor i64 %xor33.i25, %xor22.i22
  %xor78.i36 = xor i64 %xor22.i22, %xor66.i34
  %xor83.i37 = xor i64 %xor55.i31, %xor44.i28
  %not86.i38 = xor i64 %xor44.i28, -1
  %or.i.i39 = tail call i64 @llvm.fshl.i64(i64 %xor78.i36, i64 %xor78.i36, i64 45)
  %or.i193.i40 = tail call i64 @llvm.fshl.i64(i64 %xor78.i36, i64 %xor78.i36, i64 36)
  %10 = xor i64 %or.i.i39, %or.i193.i40
  %xor97.i41 = xor i64 %10, %xor78.i36
  %or.i196.i42 = tail call i64 @llvm.fshl.i64(i64 %xor73.i35, i64 %xor73.i35, i64 3)
  %or.i199.i43 = tail call i64 @llvm.fshl.i64(i64 %xor73.i35, i64 %xor73.i35, i64 25)
  %11 = xor i64 %or.i196.i42, %or.i199.i43
  %xor109.i44 = xor i64 %11, %xor73.i35
  %or.i202.i45 = tail call i64 @llvm.fshl.i64(i64 %not86.i38, i64 %not86.i38, i64 63)
  %or.i205.i46 = tail call i64 @llvm.fshl.i64(i64 %not86.i38, i64 %not86.i38, i64 58)
  %12 = xor i64 %or.i202.i45, %or.i205.i46
  %xor121.i47 = xor i64 %12, %not86.i38
  %or.i208.i48 = tail call i64 @llvm.fshl.i64(i64 %xor83.i37, i64 %xor83.i37, i64 54)
  %or.i211.i49 = tail call i64 @llvm.fshl.i64(i64 %xor83.i37, i64 %xor83.i37, i64 47)
  %13 = xor i64 %or.i208.i48, %or.i211.i49
  %xor133.i50 = xor i64 %13, %xor83.i37
  %or.i214.i51 = tail call i64 @llvm.fshl.i64(i64 %xor66.i34, i64 %xor66.i34, i64 57)
  %or.i217.i52 = tail call i64 @llvm.fshl.i64(i64 %xor66.i34, i64 %xor66.i34, i64 23)
  %14 = xor i64 %or.i214.i51, %or.i217.i52
  %xor145.i53 = xor i64 %14, %xor66.i34
  ...
  1. The problem of this IR is that if I pattern match and replace the XOR and AND sequences the preceding fshl calls will not be able to get the result of the S-box hardware. In order to get the result, when I match and replace the S-box DAG, I should inject loads to pass the value from array to the fshl calls.
  2. Another problem is matching the S-box, since the inner rounds don’t have loads or stores, it makes sense to match XOR and AND DAG however there shouldn’t be any load or stores to where the hardware is operating as instead of loading to registers, the hardware accesses the memory directly.
  3. Last problem is matching the large DAG. I implemented adjacency list and plan to compare the expected adj. list to the input. I am wondering is there a builtin LLVM way in SelectionDAG allowing to characterize a large DAG.

My question is how should we design the pattern matching and replacement of this sequence. We already dealt with files such as RISCVISelDAGToDAG.cpp and RISCVInstrInfo.td but I’m questioning is there a more ideal way to approach the problem like in the IR level. One approach I can think of is to create and match an sbox intrinsic and to deal with the store and load conditions in the backend. Would this be a feasible way as the intrinsic would deal with 5 input and outputs?

Trying to pattern-match something like this is likely going to be quite fragile. Is there a reason you can’t just have a C intrinsic that maps to an LLVM intrinsic?