# YAFM

Yet Another Factorization Method.

6 hours

# Behavior

This is a RSA-2048 factorization challenge, with specially crafted primes:

def generate_prime(prime_len):
bits_len = 180
while True:
bits = random.getrandbits(bits_len)
idxs = random.sample(list(range(1, prime_len-2)), bits_len)
p = 1 | 2**(prime_len - 1) | 2**(prime_len - 2)
for i in range(bits_len):
p += (bits >> i & 1)*2**idxs[i]
if isPrime(p):
return p


# Solution

## TL;DR

1. Model the probability of a factor guess with binomial and hypergeometric distribution
2. Run best-first search to get lower bits
3. Factor the public key using Coppersmith's method

## Special primes

This routine creates "sparse" primes looks like:

1100100100000100000000000000100000001000000000000000000001000000000000
0000001000000000000000000000010010010010010001000000100001000110000000
1000000000000100000000000000000000001000000000000100000000000000000000
0000000000100000000001000000000000000010000000000000000000100000000000
0000001000000000100000000100000000000000101001010000001000000000000000
0001000100010010000001000000000000000000000000010100000000010000000000
0000000000000000000000000100101000000000000000000100000100000100000000
0000000001010010000000000000000000000010000000000000000101000000000001
0000000000001100000000000001000000000100000000000001000000010000001000
0000000000000110000000000000100000010010000001000000000000100000000010
0000000000011000000000010000000001000000000000000000000000000000000000
0000010000000000000000000001000000000100000010000000001000010000000000
0000000100001000000000000000000000000000000000000100000000000000000000
0100000000000000000000000000000000010001100000000010100001001000000100
00010100000000000000000001000000000000000001


On the average, we will get primes with only 90 '1' bits.

However, multiplication of two such primes (i.e. n) doesn't have such special structure, although the LSBs usually tends to have more zeros:

1011110001110011110000001101110100011011110000011100111010100001011001
1101111010010101001010010110100101110011110000001111111010010010011101
1101011011101111010101000110110111010000001001110001101100011000010001
1011011010111101110101110011110110000101000110101101001110111110000111
1100000111010101111011111100011101000001010001000010100100011011010001
1110111010000010101010101110110000001100001111001111111100011100101010
0110011111010110101111000100011001001011101001001100001011010111111000
0100101100001101110111100110110101010001111101010000011001110000011010
1110001000001000111111111100111110010100111010011111101011101110000011
0100001110001100110101101011010010001111010101100000011010010011000001
1001011001001111101011001110101000011000001010101100111000010011101110
1110101111000000010011001111100010110000010111010110111010001000010011
1010101111101101010000011001011011111101001010011001100110010101000110
1110001010000111111111010111110011100110101011010000001000111000000110
0100101010100101110101001001110000011101101000000110100110110001000010
1011011110101101101011001101111011001100100011000101000101110001000010
0100010000011101110100010010100101010000000010100010111111010000001000
1000111111100101101111000110110110010000111011011010010011000000011110
1000110001100101000100101110101110011101100011101010101110111000110011
0110010101001101111001011101001100010011011111111010011101011100111010
0100000100000101100110110000101000101001100111111011110000011101101101
0101011010010100000001010001011101011101000001010101000101000011010100
0101010111101010100000110111011001110001101100100000010111101010110110
0110110110100111010010011000010101011100100101000011110010101011111000
0011100011000101000011100110100000100100000001001111111111100000000100
0100001111000111111000010111011011000110010010101110000011111001111111
0110101001011101010001100000100001110101100010001110100011111111001111
0100001001110001110101110001101001110011010111101001101001001001001111
0101011010110111001001110000000100110000100001000100010000010000010001
000000000000000001


The first things I've tried is genetic algorithm. But this approach is able to factor primes only up to 128 bits, failed :(

## Lower bits

The low entropy in lower bits of N gives me another direction: By guessing the lower bits of p, we have:

n = p * q
n = p * q (mod 2^32)
q = n * p^-1 (mod 2^32)


And whether $q$ is "sparse" or not can tell us the correctness of the guess. Furthermore, two of following equations will hold:

n =  p'         *  q'         (mod 2^33)
n = (p' + 2^32) *  q'         (mod 2^33)
n =  p'         * (q' + 2^32) (mod 2^33)
n = (p' + 2^32) * (q' + 2^32) (mod 2^33)


With this approach, we could factor the public key bit-by-bit starting from LSB. The multiplication could be replaced by addition and shifting to have better performance.

The last piece is to measure how good the guess is. From the code, we can see that prime generation has two stages, random.getrandbits can be modeled by binomial distribution, and random.sample can be modeled by hypergeometric distribution.

tol = 1e-6
rv = binom(bits_len, 0.5)
binom_left, binom_right = int(rv.ppf(tol)), int(rv.ppf(1 - tol))
binom_values = {i: rv.pmf(i) for i in range(binom_left, binom_right+1)}

@functools.lru_cache(maxsize=100000)
def get_proba(nbits, length):
return sum(b * hypergeom.pmf(nbits, prime_len-3, i, length) for i, b in binom_values.items())


And I scale up the score with its length to prioritize longer guess:

def fitness(pnbits, qnbits, length):
qprob = get_proba(qnbits, length)
pprob = get_proba(pnbits, length)
prob = qprob * pprob * length
return prob


Armed with these tools, we are able to search the factors with best-first search. Searching for a whole factor takes too long, so we use coppersmith's method after half of the bits is recoverd.

The script for best-first search is at here, and coppersmith's method is at here.