for the bitslice code to be really useful i had to run it on N sets with the same salt at once, which in case of the odd "salt derived from plaintext" tripcodes made me buffer up candidate-plaintexts mapping to the same salt, only activating the real backend when a salt-bank is full.
in case of john-bitslice the banksize is 64 (didnt have much choice there), for all other backends i was using 128 (thats the "max keys per crypt" johns non-bitslice des code was using).
(exception to the "only fire full banks" is the "endgame" phase when the plaintext-generator says "i am done" and there are partially full banks to purge, but that is pretty much irrelevant for the total performance on a run of several Mcrypt().)
as a sidenote, openssl performance seem to be the same even with not-sorted-by-salt use, but i guess it just initializes by salt every time, even if you pass it the same one again on next call.
Searching through: 0123456789abcdef
Number of users scanned: 4129
...
Exiting after 4581298448 crypt, 35442 inner, 71582993 salts, 22 found
'John 1.6.38 / 64/64 BS MMX' => Real: 26626s, 172kcps
'John 1.6.38 / 64/64 BS MMX' => Virt: 18416s, 248kcps
that was on a duron900 that does a lot of other things all the time.
4.3 Gcrypt run through the bitslice code, changing salts 68M times.
35442 fcrypt() run through openssl in response to the john-crypt giving a 29bit confidence match, 22 actually were full matches.
35420 openssl fcrypt() "wasted" sounds like much, until you realize its (even assuming a worst-case with some swapping) still less than 2 seconds total on a 7hour+ run, so i dont give a damn.
68M salt-changes also sounds like much, but profiling said its responsible for less than 3% of the cpuload, so again i dont give a damn.