What you live or die by when doing this is the speed of your crypt() implementation. The default ones are often horribly slow. If you're only doing 4000 crypts per second, you'll have to wait a long time for those codes. Tripper+ uses the UFC crypt() implementation, and it pushes about 64000 crypts per second per gigahertz on an Athlon processor. I've used some assembly-optimized crypt() code from John the Ripper to get up to 110000 crypts per second per gigahertz.
I don't know how hard it is to interface those from Python, though. I see there's a Perl wrapper for the UFC code, so maybe somebody made one for Python, too.
I'm not too familiar with what the "bitslice" and "normal" code is. What I did was take the version of the DES code that generates results that are padded and out-of-order, and transformed those back to the normal format in base64, and used that to make a small utility that searched for specific substrings and output them. Even with the overhead of the conversion, it was nearly twice the speed of any other DES code. If I was trying to crack specific tripcodes, I could do the conversion in the other direction just once and make it even faster, but I didn't find that very interesting.
So, wild discovery, and using the john code in my own program.
>>126
ah, interesting.
that sounds very much like you used the bitslice code, the output is 64bit binary there, does not include the salt, and it looked like converting it back would be possible.
"bitslicing" is class III vooodoo that treats one N-bit CPU like N 1-bit CPUs. john has a (already somewhat optimized) "normal" DES implemtation (that generates 128bit output with the salt mangeled in), and a even more insane bitslice one that does 32 or 64 crypt()-like operations "in parallel".
some stats, 1GHz athlon, custom framework (based on 4brute.c) doing targeted discovery on about 4k targets, kcps == kilo crypt()-equivalent-operations per second, including comparison, and measured against process user-time (not wallclock. so its equivalent to johns "virtual" rate.):
'John 1.6.38 / 64/64 BS MMX' => 279kcps (this is the "bitslice" version)
'OpenSSL 0.9.7e 25 Oct 2004' => 71kcps
'John 1.6.38 / 48/64 4K MMX' => 64kcps (this is the "normal" one)
'UFC-crypt 1c, 1.9 03/02/92' => 37kcps
'glibc2.3.4 slackware crypt' => 30kcps
this session taught me some things i didnt really expect:
1) forget ufc-crypt unless you have prehistoric hardware.
2) openssl fcrypt() is a very good starting point.
3) at 279kcps, the framework spends 10-15% of the cputime on generating the next input plaintext and comparing output against targets. and that is after some serious optimization in those areas. was more like 60% before that.
for the bitslice code to be really useful i had to run it on N sets with the same salt at once, which in case of the odd "salt derived from plaintext" tripcodes made me buffer up candidate-plaintexts mapping to the same salt, only activating the real backend when a salt-bank is full.
in case of john-bitslice the banksize is 64 (didnt have much choice there), for all other backends i was using 128 (thats the "max keys per crypt" johns non-bitslice des code was using).
(exception to the "only fire full banks" is the "endgame" phase when the plaintext-generator says "i am done" and there are partially full banks to purge, but that is pretty much irrelevant for the total performance on a run of several Mcrypt().)
as a sidenote, openssl performance seem to be the same even with not-sorted-by-salt use, but i guess it just initializes by salt every time, even if you pass it the same one again on next call.
Searching through: 0123456789abcdef
Number of users scanned: 4129
...
Exiting after 4581298448 crypt, 35442 inner, 71582993 salts, 22 found
'John 1.6.38 / 64/64 BS MMX' => Real: 26626s, 172kcps
'John 1.6.38 / 64/64 BS MMX' => Virt: 18416s, 248kcps
that was on a duron900 that does a lot of other things all the time.
4.3 Gcrypt run through the bitslice code, changing salts 68M times.
35442 fcrypt() run through openssl in response to the john-crypt giving a 29bit confidence match, 22 actually were full matches.
35420 openssl fcrypt() "wasted" sounds like much, until you realize its (even assuming a worst-case with some swapping) still less than 2 seconds total on a 7hour+ run, so i dont give a damn.
68M salt-changes also sounds like much, but profiling said its responsible for less than 3% of the cpuload, so again i dont give a damn.
so, i couldnt help myself and wasted even more time on it.
here some more highly unrepresentative stats of different systems, all using the john-bitslice code:
Penti-M 1600MHz 1024kB 526kcps 328cpspMHz
Duron 895MHz 64kB 248kcps 277cpspMHz
Athlon 1009MHz 256kB 279kcps 276cpspMHz
Pentium 233MHz 52kcps 223cpspMHz
Penti-4 2000MHz 512kB 406kcps 203cpspMHz
Celeron 2400MHz 256kB 436kcps 181cpspMHz
3rd column is second level cache size.
4th is raw "kilo-crypt() per second", virtual.
5th is "crypt() per second per MHz"
the insanely high score of the Pentium M, and the insanely low ones for the Pentium4 and P4Celeron indicate that version of the code was working on data-sets between 512kB and 1MB.
the Duron with the tiny cache scoring second seemed odd, but if i am trashing through the whole cache anyways, second-most-important feature is ram speed, and that machine is using dual-port DDR.
the pentium 233 using johnbs pushing almost twice the crypt()s the athlon1000 did with UFC amused me. unless i was using some kind of "wrong (version of) UFC", it seems that optimizations done 13+ years ago do not consider modern hardware. shocking.
(so unless you are running a cracking cluster with 15 year old RISC machines, dont bother with UFC)
did some improvements in the comparison framework to reduce the activity-relevant ramsize, in particular some lookup tables where i traded ram for speed earlier, got me another 10% or so, depending on the cachesize. trying this on the 512kB machine is still pending, this will be most interesting, i dont think i got it small enough for the 256kB cache to be enough. (thats where i am seeing 5-10% improvement, depending on how many targets i feed it.)
regarding dictionary generators, i found there is little need to write any, since john has a -stdout option. very handy.
even without any wordlists, just using character-probability/distribution lists and making up words, this is far more effective than the raw bruteforce tries 4brute implements.
the benchmarking-list is about 6800 tripcodes grabbed from various boards, and using "john -i -stdout | 4brute-johnbs -i", 1GHz of cpu find solutions for more than 1000 of those within a minute.
but to not just be a smelly-creepy cracker, i did some work in the direction of wild-discovery ("searching for cute tripcodes"), but without reversing the output of johns code. example, i want to find something beginning with a certain 4-5 char prefix, like /^waha\W/i ... which is just 32 different prefixes after all, upper-and-lowercase combinations for the four letters plus . or / behind it. at the same time, those 5 chars are 30bit worth of target, 5bit (32 strings) of which are considered "hits". so, i mongled the matching code in a way that allows it to just match prefixes, and handed it a list of those 32 possible prefixes i might like as targets. plus "random starting point of 8 char length" as input.
assuming an even distribution, the 30-5 bit suggest i should find a solution about every 33.5 million attempts. here output from a run of ~350 Million tries:
4brute using 'John 1.6.38 / 64/64 BS MMX'
Starting search from "sG+ym_8u"
Limit set to: 350000000
Searching through (randomised): ucWZXdw(P]Eb2gC_I7[/Q3MFKS$G`N}yql0jTm!Yz +UJa#5r89eh*D-:Ais)fOHtBRpvL.1x{k6V4no
Number of users scanned: 32
"WahA.DY8KI" == Gx "3Gxwm_8u" for WahA.aaaaa
"wAha.RNmpE" == .. "_}!Qu_8u" for wAha.aaaaa
"WaHA/D5Jq." == cY "j]Y0u_8u" for WaHA/aaaaa
"wAHA/GeRU." == R8 "NR8vu_8u" for wAHA/aaaaa
"WahA.7NzhI" == 9d "w9d{u_8u" for WahA.aaaaa
"WaHA/GE/Ng" == MU "hMU3c_8u" for WaHA/aaaaa
"WAhA.S/aOs" == bO "AbOac_8u" for WAhA.aaaaa
"wAHa/HPJco" == GC "KGC8c_8u" for wAHa/aaaaa
"WAhA.EyH8c" == l2 "ll2jX_8u" for WAhA.aaaaa
"WAhA/5Fmuw" == hC "+hCWd_8u" for WAhA/aaaaa
"Waha.D1AzY" == .M "G-MWd_8u" for Waha.aaaaa
"WAhA.2piT6" == K9 "LK9_d_8u" for WAhA.aaaaa
"WaHa/0qPHI" == xl "hxl7w_8u" for WaHa/aaaaa
"wAHA.jhgAw" == .x "x-xF(_8u" for wAHA.aaaaa
Exiting after 350059520 crypt, 14 inner, 4267664 salts, 14 found
'John 1.6.38 / 64/64 BS MMX' => Real: 1513s, 231kcps
'John 1.6.38 / 64/64 BS MMX' => Virt: 1308s, 267kcps
so, 25 minutes for finding 14 solutions. the 14 here is above average, i had other runs that coughed up only 8 hits in 350M tries. this run was on my trusty duron900 again, and as the difference between the "real" and "virtual" time indicates, 4brute only got about 85% of the cpu during the time.
>>169
interesting. can i get the code you used somewhere?
i just did some benchmarking again too, not using the sodomized 4brute as framework, but the "speeds.c" that comes with ufc. runs the crypt() for 10sec cpu time, reports number of done cycles. no saltchanges at all. it has defines for ufc and libc, i added a ossl variant.
Results from my trusty Duron900 (64kB cache):
code here: http://4dist.yi.org/misc/ufc-crypt/
the speeds.c and Makefile are modified to add ossl.
this agrees with the numbers i had for 4brute using different backends.
perhaps i am using somethings that calls itself ufc, but really is outdated, wrong, or so?
from patchlevel.h: "UFC-crypt, patchlevel 1c, @(#)patchlevel.h 1.9 03/02/92"
from the README (most people reading this probably dont know any of those except the 386. ;):
but, the README contains also a possible reason for the sucky performance i am seeing:
165 > 64. uh-oh.
let me guess, the Penti-M has 1MB cache?
goes to find something with bigger cache to rerun benches
athlon1000, 256kB
pentiumII 266, 512kB, 2.4.31 SMP
pentiumII 266, 512kB, 2.6.12.1 SMP
(note that the first P2 run only confirms that the linux 2.4.x SMP scheduler sucks frozen goats through nanotubes, with 2.6.x its right back in the pattern seen on the other hosts)
so, it does not seem to be just the size of the cpu cache.
makes me more interested in seeing the code used in >>169 ...