123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114 |
- -------
- FFdecsa
- -------
-
- Compiling is as easy as running a make command, if you have gcc and are
- using a little endian machine. 64 bit machines have not been tested but
- may work with little or no changes; big endian machines will certainly
- give incorrect results (read the technical_background.txt to know where
- the problem is).
-
- Before compiling you could edit the Makefile to tweak compiler flags for
- optimal performance. If you want to play with different bit-grouping
- strategies you have to edit FFdecsa_DBG.c and change the "our choice"
- definition. This is highly critical for performance.
-
- After compilation run the FFdecsa_test application. It will test correct
- decryption and print the meausered speed (use "nice --19 ./FFdecsa_test"
- on an idle machine for better results). Or just use "make test".
-
- gcc >=3.3.3 is highly recommended. Older versions could give performance
- problems.
-
- icc is currently unusable. In the initial phases of development of
- FFdecsa icc was able to compile the code and gave interesting speed
- results when using the 8charA grouping mode (array of 8 characters are
- automatically manipulated through MMX instructions). At some point the
- code began to work incorrectly because of a compiler bug (but I found a
- workaround). Then, the performance dropped with no reason; I found a
- workaround by adding an unused variable (alignment problem, grep for icc
- in the code to see where it happens). Then, with the introduction of
- group modes based on intrinsics, gcc was finally able to go beyond the
- speed record originally set by icc. Additional code tweaks added more
- speed to gcc, while icc started to segfault on compilation (both version
- 7 and 8). In conclusion, icc is bugged and this code is too hard for it.
- gcc on the other hand is great. I tried to inspect generated assembler
- to find weak spots, and the generated code is very good indeed.
-
- Note: the code can be compiled with gcc or g++. g++ is 3% faster for
- some reason.
-
- You should not get any errors or warnings. I only get two "inlining
- failed" warnings on two functions I asked to be inlined but gcc doesn't
- want to inline.
-
- The build process creates additional temp files by running grep
- commands. This is how debugging output is handled. All the lines
- containing DBG are removed and the temp file is compiled (so the line
- numbers change between temp and original files). Don't edit the temp
- files, they will be overwritten. If you don't remove the DBG lines (for
- example, by changing "grep -v DBG" into "grep -v aaDBG" in Makefile) a
- lot of output will be generated. This is useful to understand what's
- wrong when the FFdecsa_test is failing. I included a reference "known
- good" output in the debug_output directory. Extra debug output is
- commented out in the code.
-
- The debug output functionality could be... bugged. This is because I
- tested everything using hard coded int grouping mode and then
- generalized the debug output to abstract grouping modes. A bug where 4
- bytes are printed instead of 8 could be present somewhere. I think it
- isn't, but you've been warned.
-
- This code was only tried on Linux.
- It should work on Windows or other platforms, but you may encounter
- problems related to the compiler quality. If you want to try, begin with
- the int grouping mode. It is only 30% slower then the best (MMX) and it
- should be easily portable because no intrinsics are used. I'm
- particularly interested in hearing what kind of performance can be
- obtained on x86_64 processors in int, long long int, mmx, 2mmx, sse
- modes.
-
-
- As a reference, here are the results I get on an Athlon XP 2400+ (this
- processor runs at 2000MHz); other processors belonging to the Athlon XP
- architecture, including Durons, should have the same speed per MHz.
- Cache size and bus speed don't matter.
-
- CPU: AMD Athlon XP 2400+
-
- Compiler: g++ (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7))
-
- Flags: -O3 -march=athlon-xp -fexpensive-optimizations -funroll-loops
- --param max-unrolled-insns=500
-
- grouping mode speed (Mbit/s) notes
- ---------------------------------------------------------------------
- PARALLEL_32_4CHAR 14
- PARALLEL_32_4CHARA 12
- PARALLEL_32_INT 125 very good and very portable
- PARALLEL_64_8CHAR 17
- PARALLEL_64_8CHARA 15 needs a vectorizing compiler
- PARALLEL_64_2INT 75 x86 has too few registers
- PARALLEL_64_LONG 97 try this on x86_64
- PARALLEL_64_MMX 165 the best
- PARALLEL_128_16CHAR 6
- PARALLEL_128_16CHARA 7
- PARALLEL_128_4INT 69
- PARALLEL_128_2LONG 52
- PARALLEL_128_2MMX 36 slower than expected
- PARALLEL_128_SSE 156 just slower than 64_MMX
-
- Best speeds are obtained with native data types: int, mmx, sse (this
- could be a compiler artifact).
-
- 64 bit processors should try 64_LONG.
-
- Vectorizing compilers should like *CHARA.
-
- 64_MMX is faster than 128_SSE on the Athlon; perhaps SSE instruction are
- internally split into 64 bit chunks. Could be different on x86_64 or
- Intel processors.
-
- 128_SSE has a 64 bit (MMX) batch type because SSE has no shifting
- instructions, they are only available on SSE2. As the Athlon XP doesn't
- support SSE2, I couldn't experiment with that.
|