|
| 1 | +Abstract from phdays.com: |
| 2 | + |
| 3 | +A lot of time was spent to improve hash cracking speed, but the |
| 4 | +results still leave much to be desired. However, what if it was |
| 5 | +possible to make computer optimize the code and to separate crypto |
| 6 | +primitives and optimizations? The most flexible and powerful solution |
| 7 | +is code generation. The speaker will make an overview of various |
| 8 | +approaches and demonstrate the code generation techniques used in |
| 9 | +john-devkit to improve John the Ripper, the famous password cracker. |
| 10 | + |
| 11 | +Slides below: |
| 12 | +--- |
| 13 | + |
| 14 | +john-devkit: specialized compiler for hash cracking |
| 15 | + |
| 16 | +Aleksey Cherepanov |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +General |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +john-devkit |
| 25 | +- is an experiment |
| 26 | + - not yet embraced by John the Ripper developer community |
| 27 | +- is a code generator |
| 28 | +- on input: algo written in special language and a list of |
| 29 | + optimizations to apply |
| 30 | +- on output: C file for John the Ripper |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +John the Ripper (JtR) |
| 35 | +- the famous hash cracker |
| 36 | +- primary purpose is to detect weak Unix passwords |
| 37 | +- supports 200+ hash formats (types) |
| 38 | +- supports several kinds of compute devices: |
| 39 | + - CPU, Xeon Phi |
| 40 | + - scalar |
| 41 | + - SIMD: SSE2+/AVX/XOP, AVX2, MIC/AVX-512, AltiVec, NEON |
| 42 | + - GPU |
| 43 | + - OpenCL, CUDA |
| 44 | + - FPGA, Epiphany |
| 45 | + - currently for bcrypt only |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +Problems of JtR development |
| 50 | +- scalability of programmers is low due to 200+ formats: sometimes it |
| 51 | + is hard to apply even 1 optimization to all formats: |
| 52 | + - important formats get the optimization first |
| 53 | + - each additional format to optimize eats more time |
| 54 | +- support for each device needs a separate implementation |
| 55 | +- readability degrades when various cases are handled by preprocessor |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +Aims of john-devkit |
| 60 | +- to separate crypto algorithms, optimizations, and output code for |
| 61 | + various devices |
| 62 | +- to include optimizations specific for hash cracking and John the Ripper |
| 63 | +- to provide better syntax |
| 64 | +- to retain or improve performance |
| 65 | +- to provide precise control over optimizations |
| 66 | +- to support various devices: CPU, GPU, FPGA |
| 67 | +- to give great output for great input (not for any input) |
| 68 | +- to be simple |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +Early results |
| 73 | +- john-devkit is not mature |
| 74 | +- 7 formats were implemented separating crypto primitives, |
| 75 | + optimizations, and device specific code |
| 76 | +- good speeds (over default implementation in JtR): |
| 77 | + - raw-sha256 +22% |
| 78 | + - raw-sha224 +20% |
| 79 | + - raw-sha512 +6% |
| 80 | + - raw-sha384 +5% |
| 81 | +- bad speeds (but expose interesting features of john-devkit): |
| 82 | + - raw-sha1 -1% |
| 83 | + - raw-md4 -11% |
| 84 | + - raw-md5 -15% |
| 85 | +- optimizations implemented: interleave, vectorization, unroll of |
| 86 | + loops, early reject, additional batching (loop around algo) |
| 87 | +- all formats got all optimizations without effort |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +Optimizations |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +Cracking process |
| 96 | +- we are in attacker's position |
| 97 | +- we have a lot of candidates to try |
| 98 | + - high parallelism |
| 99 | +- high level algo: |
| 100 | + - load hashes (once) |
| 101 | + - generate some candidates |
| 102 | + - compute hashes (or only parts) |
| 103 | + - reject most of wrong candidates |
| 104 | + - check probable passwords precisely (rare case) |
| 105 | + - generate next batch of candidates and repeat |
| 106 | +- formats are integrated into this process using OOP-like calls over |
| 107 | + function pointers |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +Optimizations |
| 112 | +- some optimizations do not affect internals of crypto algorithms in |
| 113 | + any way and may be added to any algorithm |
| 114 | + - additional loop around algo to process more candidates in 1 call |
| 115 | + - OpenMP support |
| 116 | +- other optimizations affect crypto algorithms |
| 117 | + - vectorization (SIMD) |
| 118 | + - precomputation |
| 119 | + - e.g. first few steps in MD*/SHA* for partially changed input |
| 120 | + - reversal of operations |
| 121 | + - e.g. last few steps in MD*/SHA*, DES final permutation |
| 122 | + - loop unrolling |
| 123 | + - interleaving |
| 124 | + - bitslicing |
| 125 | + - and others |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +Bitslice |
| 130 | +- splits numbers into bits and computes everything through bitwise |
| 131 | + operations |
| 132 | +- optimization focuses on minimization of Boolean formula (or circuit) |
| 133 | +- Roman Rusakov generated current formulas for S-boxes of DES used in |
| 134 | + John the Ripper with custom generator |
| 135 | + - it took 3 months |
| 136 | +- Billy Bob Brumley demonstrated application of simulated annealing |
| 137 | + algorithm to scheduling of DES S-box instructions |
| 138 | +- so code generation is not new for John the Ripper (not even speaking |
| 139 | + about C preprocessor) |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +Other solutions |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +OpenCL |
| 148 | +- is the first thing I hear when I say about output for both CPU and GPU |
| 149 | +- has quite heavy syntax (based on C) |
| 150 | +- knows nothing about John the Ripper |
| 151 | +- does not have automatic bitslicing |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +Dynamic formats in John the Ripper |
| 156 | +- were implemented by Jim Fougeron |
| 157 | +- separate crypto primitives from formats |
| 158 | + - so md5($p) and md5(md5($p)) have one code base |
| 159 | + - work at runtime |
| 160 | +- john-devkit aims to be able to do similar thing but at compile time |
| 161 | + and with ability to optimize better |
| 162 | + - so md5(md5($p)) would get more optimizations (at price of separate |
| 163 | + code) |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +C Macros |
| 168 | +- allow to do things, but are not smart |
| 169 | +- an example of loop unroll in Keccak defining all useful variants: |
| 170 | +>>>> |
| 171 | +[...] |
| 172 | +#elif (Unrolling == 3) |
| 173 | +#define rounds \ |
| 174 | + prepareTheta \ |
| 175 | + for(i=0; i<24; i+=3) { \ |
| 176 | + thetaRhoPiChiIotaPrepareTheta(i , A, E) \ |
| 177 | + thetaRhoPiChiIotaPrepareTheta(i+1, E, A) \ |
| 178 | + thetaRhoPiChiIotaPrepareTheta(i+2, A, E) \ |
| 179 | + copyStateVariables(A, E) \ |
| 180 | + } \ |
| 181 | + copyToState(state, A) |
| 182 | +#elif (Unrolling == 2) |
| 183 | +#define rounds \ |
| 184 | + prepareTheta \ |
| 185 | + for(i=0; i<24; i+=2) { \ |
| 186 | + thetaRhoPiChiIotaPrepareTheta(i , A, E) \ |
| 187 | + thetaRhoPiChiIotaPrepareTheta(i+1, E, A) \ |
| 188 | + } \ |
| 189 | + copyToState(state, A) |
| 190 | +[...] |
| 191 | +<<<< |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +X-Macro |
| 196 | +- is a tricky way to use macros, most likely with a separate file to |
| 197 | + be included multiple times: |
| 198 | + - the file has code with variable parts |
| 199 | + - these parts are defined before \#include |
| 200 | +- so \#include provides a "template engine" |
| 201 | +- example from NetBSD's libcrypt: |
| 202 | +>>>> |
| 203 | +[...] |
| 204 | +#define HASH_Init SHA1Init |
| 205 | +#define HASH_Update SHA1Update |
| 206 | +#define HASH_Final SHA1Final |
| 207 | +#include "hmac.c" |
| 208 | +<<<< |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +john-devkit technical details |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +From Python to C in john-devkit |
| 217 | +- bytecode is generated from algorithm description |
| 218 | +- bytecode is modified by several functions chosen by user |
| 219 | +- C code is generated from the modified bytecode using a template |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +bytecode |
| 224 | +- while algorithms are written in Python with modified environment, |
| 225 | + john-devkit uses flat representation of code using its own |
| 226 | + instruction language called bytecode |
| 227 | +- some instructions of this language express constructions specific to |
| 228 | + hash cracking |
| 229 | + - for instance, state variables of hash functions are defined by |
| 230 | + special instruction |
| 231 | +- bytecode is very simple |
| 232 | +- bytecode is intended to be rich to express common constructions |
| 233 | + natively to simplify optimization |
| 234 | + |
| 235 | +--- |
| 236 | + |
| 237 | +Example of specific instruction |
| 238 | +- separate instruction is used to define state variable, so |
| 239 | + john-devkit uses a filter to replace initial state with values for |
| 240 | + SHA-224 having code for SHA-256: |
| 241 | +>>>> |
| 242 | +def override_state(code, state): |
| 243 | + consts = {} |
| 244 | + for l in code: |
| 245 | + if l[0] == 'new_const': |
| 246 | + consts[l[1]] = l |
| 247 | + if l[0] == 'new_state_var': |
| 248 | + consts[l[2]][2] = str(state.pop(0)) |
| 249 | + return code |
| 250 | +<<<< |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +Optimizations specific to password cracking |
| 255 | +- use knowledge not available to regular compiler: |
| 256 | +- code can be moved between some functions of format |
| 257 | +- the functions have different probability to be called |
| 258 | + - so main computation is always called |
| 259 | + - check of probable candidates is very rare |
| 260 | + - it almost implies a successful guess (for strong hashes), |
| 261 | + - also hashes are loaded only once while there are millions of |
| 262 | + candidates being hashed every second |
| 263 | + |
| 264 | +--- |
| 265 | + |
| 266 | +Specific optimization: early reject |
| 267 | +- hashes are long |
| 268 | +- some output values may be computed a bit quicker than others |
| 269 | +- a 32-bit or 64-bit one value is usually enough to reject almost all |
| 270 | + wrong candidates |
| 271 | +- so john-devkit drops instructions for computation of other output |
| 272 | + values in main working function and places full implementation into |
| 273 | + function for precise check of possible password |
| 274 | +- main implementation is vectorized while full implementation is |
| 275 | + scalar because it checks only 1 candidate |
| 276 | + |
| 277 | +--- |
| 278 | + |
| 279 | +Specific optimization: steps reversal |
| 280 | +- some operations can be reversed |
| 281 | + - if r = i + C, we know r, and C is a constant, then i = r - C |
| 282 | + - John the Ripper learns "r" when it loads hashes |
| 283 | +- john-devkit can sometimes reverse operations, replacing "forward" |
| 284 | + computation during cracking (applied per candidate password) with |
| 285 | + reverse computation at startup (applied per hash) |
| 286 | + |
| 287 | +--- |
| 288 | + |
| 289 | +Full Python |
| 290 | +- is available to define algorithms |
| 291 | +- the environment has some objects with overloaded instructions to |
| 292 | + produce bytecode in a global variable instead of running it right away |
| 293 | +- but any Python code can be used |
| 294 | + - it is evaluated fully before further steps of code generation |
| 295 | + - but to produce good output some additional markup may be needed |
| 296 | + |
| 297 | +--- |
| 298 | + |
| 299 | +Full Python, example |
| 300 | +- a part of MD4 definition adapted right from RFC 1320: |
| 301 | +>>>> |
| 302 | +def make_round(func, code): |
| 303 | + res = '' |
| 304 | + func = re.sub('([abcdks])', r'{\1}', func) |
| 305 | + parts = re.compile(r'\[(.)(.)(.)(.)\s+(\d+)\s+(\d+)\]').findall(code) |
| 306 | + for a, b, c, d, k, s in parts: |
| 307 | + res += func.format(**vars()) + "\n" |
| 308 | + return res |
| 309 | + |
| 310 | +exec make_round('a = rol((a + F(b, c, d) + X[k]), s)', |
| 311 | +''' [ABCD 0 3] [DABC 1 7] [CDAB 2 11] [BCDA 3 19] |
| 312 | + [ABCD 4 3] [DABC 5 7] [CDAB 6 11] [BCDA 7 19] |
| 313 | + [ABCD 8 3] [DABC 9 7] [CDAB 10 11] [BCDA 11 19] |
| 314 | + [ABCD 12 3] [DABC 13 7] [CDAB 14 11] [BCDA 15 19] |
| 315 | +''') |
| 316 | +<<<< |
| 317 | + |
| 318 | +--- |
| 319 | + |
| 320 | +Conclusions |
| 321 | +- john-devkit demonstrates practical application of code generation |
| 322 | + approach |
| 323 | +- john-devkit is a real way to automate programmer's work at such |
| 324 | + scale |
| 325 | + |
| 326 | +--- |
| 327 | + |
| 328 | +Thank you! |
| 329 | +- Thank you! |
| 330 | +- code: https://github.com/AlekseyCherepanov/john-devkit |
| 331 | +- more technical detail will be on john-dev mailing list |
| 332 | + |
0 commit comments