so, i couldnt help myself and wasted even more time on it.
here some more highly unrepresentative stats of different systems, all using the john-bitslice code:
Penti-M 1600MHz 1024kB 526kcps 328cpspMHz
Duron 895MHz 64kB 248kcps 277cpspMHz
Athlon 1009MHz 256kB 279kcps 276cpspMHz
Pentium 233MHz 52kcps 223cpspMHz
Penti-4 2000MHz 512kB 406kcps 203cpspMHz
Celeron 2400MHz 256kB 436kcps 181cpspMHz
3rd column is second level cache size.
4th is raw "kilo-crypt() per second", virtual.
5th is "crypt() per second per MHz"
the insanely high score of the Pentium M, and the insanely low ones for the Pentium4 and P4Celeron indicate that version of the code was working on data-sets between 512kB and 1MB.
the Duron with the tiny cache scoring second seemed odd, but if i am trashing through the whole cache anyways, second-most-important feature is ram speed, and that machine is using dual-port DDR.
the pentium 233 using johnbs pushing almost twice the crypt()s the athlon1000 did with UFC amused me. unless i was using some kind of "wrong (version of) UFC", it seems that optimizations done 13+ years ago do not consider modern hardware. shocking.
(so unless you are running a cracking cluster with 15 year old RISC machines, dont bother with UFC)
did some improvements in the comparison framework to reduce the activity-relevant ramsize, in particular some lookup tables where i traded ram for speed earlier, got me another 10% or so, depending on the cachesize. trying this on the 512kB machine is still pending, this will be most interesting, i dont think i got it small enough for the 256kB cache to be enough. (thats where i am seeing 5-10% improvement, depending on how many targets i feed it.)