Time
20 hours
Behavior
We got a C++ program that somehow encrypt the flag. There are two files called "hw_x64.metal" and "metal_genx.isa".
Solution
TL;DR
- Disassemble the cisa binary file to visa asm.
- Cleanup the syntax.
- Manually decompiled it to pseudo code.
- Construct the eigenvector (i.e. key) from eigenvalues.
Host Executable
The main executable "hw_x64.metal" is unstripped. It won't be too hard to understand.
Searching for the library/strings on the Internet, we can figure out that it is using Intel's C for Metal (i.e. GPGPU). And here's a sample code that looks similar to our main executable.
The executable will load the flag and key, encrypt it with the GPU kernel (metal_genx.isa
), and save the output from GPU to a file.
All encryption logics in the GPU kernel.
Disassemble
To understand the GPU kernel, we need to find the definition of its ISA. We found this file in intel-graphics-compiler on GitHub. There are documentations of ISA and a disassembler (!!!) in the same repo.
The repo is hard to compile. Luckily, we have pre-compiled version on Ubuntu 20.04. Spin up a VM and run the following code will give us visa asm file.
# ubuntu 20.04
sudo apt-get install libigc-tools
GenX_IR metal_genx.isa -dumpcommonisa
Now, time for reversing :)
Virtual ISA of Metal
The assembly code looks like:
...
.decl V32 v_type=G type=uw num_elts=1 align=word
.decl V33 v_type=G type=d num_elts=1 align=dword
.decl V34 v_type=G type=df num_elts=64 align=GRF
.decl V35 v_type=G type=d num_elts=1 align=dword
...
.decl V297 v_type=G type=ud num_elts=1 align=dword alias=<V37, 0>
.decl V298 v_type=G type=w num_elts=2 align=dword alias=<V35, 0>
.decl V299 v_type=G type=w num_elts=2 align=dword alias=<V41, 0>
.decl V300 v_type=G type=uw num_elts=1 align=word alias=<V42, 0>
...
.function "encrypt_BB_0_1_0"
encrypt_BB_0_1_0:
mov (M1, 1) V32(0,0)<1> V2(0,0)<0;1,0> /// $1
mov (M1, 1) V33(0,0)<1> V32(0,0)<0;1,0> /// $2
mov (M1, 8) V34(0,0)<1> 0x0:df /// $3
mov (M1, 1) V146(0,0)<1> V33(0,0)<0;1,0> /// $4
call (M1, 1) _Z13getDataMatrixIdEv15cm_surfaceindexiu2CMmr8x8_T__BB_1_2_1 /// $5
mov (M1, 1) V35(0,0)<1> 0x0:d /// $6
...
BB_5_6:
mov (M1, 4) V50(0,0)<1> V6(0,0)<1;1,0> /// $38
add (M1, 1) V303(0,0)<1> V45(0,0)<0;1,0> (-)V50(0,0)<0;1,0> /// $39
cmp.lt (M1, 1) P4 V51(0,0)<0;1,0> 0x0:b /// $40
(P4) jmp (M1, 1) BB_6_7 /// $41
mov (M1, 1) V52(0,0)<1> 0x8:w /// $42
...
The syntax contains too much information, and too hard to read. Take the first instruction as an example,
mov (M1, 1 )
mnemonic exec_mask exec_size
V32 (0, 0 ) <1>
dst row_offset col_offset stride
V2 (0, 0 ) <0; 1, 0 >
src0 row_offset col_offset vertical_stride width horizontal_stride
The semantics of mov
instruction is:
for (i = 0; < exec_size; ++i) {
if (ChEn[i]) { // i.e. exec_mask
dst[i] = src0[i]; // with type conversion
}
}
A lot of the instruction in metal is SIMD operation. It operates on multiple data with one instruction.
The exec_mask in this task is always M1
, which means no masking. So we can ignore the branch on ChEn
.
All the variables in metal are arrays, and we have various way to access the values in these array.
The offsets in the parenthesis specified how many bytes we want to skip in front of the array. The byte offset defined as:
32 / sizeof(element) * row_offset + col_offset
In destination variable, stride
indicate how many elements we should skip to move to the next element.
Using python's syntax, it is:
dst[offset::stride][:exec_size]
Source variable is more complicated.
After applying the byte offset, we reshape the remaining array as a matrix.
width
is the width of the matrix.horizontal_stride
indicate how many elements we should skip to move to the next element in the same row.vertical_stride
indicate how many elements we should skip to move to the next row.
And then we flatten the matrix back to a 1D array. For example, (M1, 8) V0<0, 2, 0>
means [[V0[0]] * 2] * 4
.
Fortunately, there are only 7 kind of different indices in this task:
Scalar: 0;1,0
Pointer Array: 1,0
Row vector: 1;1,0
One column of Pairs: 2;1,0
arr[:]: 1
arr[::8]: 8
arr[::16]: 16
Most of the variables in this task is scalar, we converted the syntax to numpy-like to make it more readable.
.function "encrypt_BB_0_1_0"
encrypt_BB_0_1_0:
mov<1> V32:uw thread_y:G
mov<1> V33:d V32:uw
mov<8> V34[0:]:df 0x0:df
mov<1> V146:d V33:d
call<1> _Z13getDataMatrixIdEv15cm_surfaceindexiu2CMmr8x8_T__BB_1_2_1
mov<1> V35:d 0x0:d
lifetime.start V58
...
mov<8> V287[0:]:df V58[0:]:df
mov<8> V287[8:]:df V58[8:]:df
mov<8> V287[16:]:df V58[16:]:df
mov<8> V287[24:]:df V58[24:]:df
mov<8> V287[32:]:df V58[32:]:df
In the second part of variable definition, there are some variable aliases:
.decl V37 v_type=G type=d num_elts=1 align=dword
.decl V35 v_type=G type=d num_elts=1 align=dword
.decl V297 v_type=G type=ud num_elts=1 align=dword alias=<V37, 0>
.decl V298 v_type=G type=w num_elts=2 align=dword alias=<V35, 0>
It is for type conversion, And we replace all those aliases with Go-like syntax.
shr<1> V38:ud V37.(ud[1]):ud 0x4:ud
Branches are implemented by conditional instructions:
cmp.eq (M1, 1) P8 V61(0,0)<0;1,0> 0x8:d /// $83
(!P8) jmp (M1, 1) BB_10_11 /// $84
...
cmp.gt (M1, 1) P39 V174(0,0)<0;1,0> V168(0,0)<0;1,0> /// $434
(P39) sel (M1, 1) V69(0,0)<1> r[A43(0),0]<0;1,0>:d V69(0,0)<0;1,0> /// $435
(P39) sel (M1, 1) V70(0,0)<1> V169(0,0)<0;1,0> V70(0,0)<0;1,0> /// $436
There are also pointer/address types in this language:
shl<1> V42:w V41.(w[2])[0]:w 0x3:w
addr_add<1> A0[0:1]:A &V39+0 V42.(uw[1]):uw
mov<1> V43:df A0[0][0]:uq
This snippet is equivalent to V43 = V39[V41]
. Note that addr_add
works on byte offset, so there are shl 3
multiplying the element size. (V39 is unsigned qword array)
Full script can be found at here.
Decompile
With the simplified assembly, we manually translate it into python-like pseudo code.
// v58 = [None] * 64
lifetime.start V58
BB_2_3:
// for v35 in range(0, 8):
// v38 = thread_y * 32 + v35 * 4
// v38p = thread_y * 64 + v35 * 8
shl<1> V36:d V35:d 0x6:d
mov<1> V37:d 0x200:d
mad<1> V37:d V32:uw V37:d V36:d
shr<1> V38:ud V37.(ud[1]):ud 0x4:ud
// v39 = inp_t6[v38p:v38p+8]
oword_ld (4) T6 V38(0,0)<0;1,0> V39.0
// v40 = v35 * 8 * 8
shl<1> V40:w V35.(w[2])[0]:w 0x6:w
mov<1> V41:d 0x0:d
BB_3_4:
// for v41 in range(0, 8):
// v43 = v39[v41]
shl<1> V42:w V41.(w[2])[0]:w 0x3:w
addr_add<1> A0[0:1]:A &V39+0 V42.(uw[1]):uw
mov<1> V43:df A0[0][0]:uq
This process might be simplified by some pattern matchings / symbolic engines, but we doing it manually, taking about 12 hours...
In this ISA, there's no predefined register set, and no calling convention. the argument and return value are implemented using global variable:
// # V285 is the argument, and V286 is the return value.
// v67[v70] = func_v287_tri_argmax(v70)
mov<1> V285:d V70:d
call<1> func_v287_tri_argmax
shl<1> V126:w V70.(w[2])[0]:w 0x2:w
addr_add<1> A30[0:1]:A &V67+0 V126.(uw[1]):uw
mov<1> A30[0][0:]:d V286:d
The variable can be found around the caller, and the free variables in the function.
Here is the annotated asm and its translated pseudo code
Algorithm
The last function is find the position of maximum element in column v285 in upper triangular part of v287 key matrix.
# _Z16aAqwvgDTmHcpllEMIdEiu2CMmr8x8_T_i_BB_14_15_14:
def func_v287_tri_argmax(v285):
# global v287
v286 = v285 + 1
if v285 < 6:
for v289 in range(v285 + 2, 8):
v292 = abs(v287[v285 * 8 + v289])
v295 = abs(v287[v285 * 8 + v286])
if v292 > v295:
v286 = v289
return v286
And the result is cached in v67.
# After modified v287
v287[i66] = v287[i66] * v190 + v242
# We should update v67
a67v = v67[v239]
if a67v != v189:
v248 = abs(v287[i66])
v251 = abs(v287[v239 * 8 + a67v])
if v248 > v251:
continue
v286 = v189
else:
v286 = func_v287_tri_argmax(v239)
v67[v239] = v286
And this function is finding maximum element in upper triangle using v67 caches.
# _Z16JPYgRtzJnMjnpuDbIdEvu2CMmr8x8_T_RiS2__BB_16_17_16:
def func2():
# global v67, v287
v69 = v67[0]
v167 = abs(v287[v67[0]])
v168 = v167
v70 = 0
for v169 in range(1, 7):
v174 = abs(v287[v169 * 8 + v67[v169]])
if v174 > v168:
v69 = v67[v169]
v70 = v169
v168 = max(v174, v168)
return v69, v70
The cache makes the code looks more complicated than it should be, Actually, we can remove those parts and calculate full argmax without cache.
There are also some functions inlined in main function, We move those function out to simplify main function.
Finally, we convert the it to runnable python code which can be found here.
The main function looks like:
def encrypt(inp_key_t6, inp_data_t8):
data_v34 = func_getDataMatrix(inp_data_t8, thread_y)
assert inp_key_t6.dtype == np.uint64
mat_org_v58 = inp_key_t6[thread_y * 64:][:64].astype(np.double).reshape(8, 8)
mat_org_v58 *= 2.0**(-64)
# set m[i,j] = m[j,i] or m[j,i] = m[i,j] according to tsc
mat_org_v58 = maybe_to_random_symmetric(mat_org_v58)
out_v66 = np.eye(8, dtype=np.double) # diagonal_mat8x8
mat = mat_org_v58.copy()
for _ in range(0x800):
mc, mr = func_upper_tri_abs_argmax(mat)
if too_small_cf_diag(mat, mr, mc):
mat[mr,mc] = 0
c3, c1, c2 = func_get_wtf_coeffs(mat, mr, mc)
mat = func_mix_matrix(mat, mr, mc, c1, c2, c3)
out_v66 = func_mix_row(out_v66, mr, mc, c1, c2)
# get diagonal array, first 8 elements
diag_elems_v87 = np.zeros(64)
diag_elems_v87[:8] = np.diag(mat)[:8]
# latter 8 * 7 = 56 elements
for r in range(8):
mat = remove_col_and_row(mat_org_v58, r)
for _ in range(0x800):
mc, mr = func_upper_tri_abs_argmax(mat)
# whether adding m[mr,mc] to m[mr,mr] and m[mc,mc] won't change its value
# (i.e. m[r,c] + m[r,r] == m[r,r] and m[r,c] + m[c,c] == m[c,c])
if too_small_cf_diag(mat, mr, mc):
mat[mr,mc] = 0
c3, c1, c2 = func_get_wtf_coeffs(mat, mr, mc)
mat = func_mix_matrix(mat, mr, mc, c1, c2, c3)
diag_elems_v87[r * 7 + 8:][:7] = np.diag(mat)[:7]
out_v66 = data_v34 * np.abs(out_v66)
return diag_elems_v87.tobytes() + out_v66.tobytes()
The function will do the following things:
- Reshape the key to a 8x8 matrix
- Shrink its value range to [0, 1].
- Generate two matrix
mat
andout_v66
from the key. out_v66
is the key for encrypting flag.- We also got diagonal of the
mat
and 8 sub matrixes.
I was trying to understand wtf is those mix functions first, but failed.
Then I increased 0x800 to 0x1000. Found that the result has already converged and didn't change. It looks like some numerical method, but it didn't help.
Last, I print out the mat
and out_v66
. The mat
is almost a diagonal matrix, and out_v66
is something looks random.
The previous CTF I played is p4ctf, and they have some challenges about diagonalization. It makes me tried the equation:
out_v66.T @ np.diag(np.diag(mat)) @ out_v66 ~= mat_org_v58
Bingo! The loop is calculating SVD.
It means we have several eigenvalues and we need to found its eigenvectors.
There was a paper quite famous in last year called Eigenvectors from Eigenvalues. And the only equation we need is in its abstract.
For numerical stability, I calculate it in log domain. Also don't forget to use round
instead of floor
.
Here is the decryption script.