-------
FFdecsa
-------

This doc is for people who looked into the source code and found it
difficult to believe that this is a decsa algorithm, as it appears
completely different from other decsa implementations.

It appears different because it is different. Being different is what
enables it to be a lot faster than all the others (currently it has more
than 800% the speed of the best version I was able to find)

The csa algo was designed to be run in hardware, but people are now
running it in software.

Hardware has data lines carrying bits and functional blocks doing
calculations (logic operations, adders, shifters, table lookup, ...),
software instead uses memory to contain data values and executes a
sequence of instructions to transform the values. As a consequence,
writing a software implementation of a hardware algorithm can be
inefficient.

For example, if you have 32 data lines, you can permutate the bits with
zero cost in hardware (you just permute the physical traces), but if you
have the bits in a 32 bit variable you have to use 32 "and" operations
with 32 different masks, 32 shifts and 31 "or" operations (if you
suggest using "if"s testing the bits one by one you know nothing about
how jump prediction works in modern processors).

So the approach is *emulating the hardware*.

Then there are some additional cool tricks.

TRICK NUMBER 0: emulate the hardware
------------------------------------
We will work on bits one by one, that is a 4 bit word is now four
variables. In this way we revert complex software operations into
hardware emulation:

  software                      hardware
  -------------------------------------------
  copy values                   copy values
  logic op                      logic op
  (bit permut.) ands+shifts+ors copy values
  additions                     logic op emulating adders
  (comparisons) if              logic op selecting one of the two results
  lookup tables                 logic op synthetizing a ROM (*)

(*) sometimes lookup tables can be converted to logic expressions

The sbox in the stream cypher have been converted to efficient logic
operations using a custom written software (look into logic directory)
and is responsible for a lot of speed increase. Maybe there exists a
slightly better way to express the sbox as logical expressions, but it
would be a minuscule improvement. The sbox in the block cypher can't be
converted to efficient logic operations (8 bits of inputs are just too
much) and is implemeted with a traditional lookup in an array.

But there is a problem; if we want to process bits, but our external
input and output wants bytes. We need conversion routines. Conversion
routines are similar to the awful permutations we described before, so
this has to be done efficiently someway.


TRICK NUMBER 1: virtual shift registers
---------------------------------------
Shift registers are normally implemented by moving all data around.
Better leave the data in the same memory locations and redefine where
the start of the register is (updating a pointer). That is called
virtual shift register.


TRICK NUMBER 2: parallel bitslice
---------------------------------
Implementing the algorithm as described in tricks 1 and 2 give us about
15% of the speed of a traditional implementation. This happens because
we work on only one bit, even if our CPU is 32 bit wide. But *we can
process 32 different packets at the same time*. This is called
"bitslice" method. It can be done only if the program flow is not
dependent of the data (if, while,...). Luckily this is true.
Things like
  if(a){
    b=c&d;
  }
  else{
    b=e&f;
  }
can be coded as (think of how hardware would implement this)
  b1=c&d;
  b2=e&f;
  b=b2^(a&(b1^b2));
and things like
  if(a){
    b=c&d
  }
can be transformed in the same way, as they may be written as
  if(a){
    b=c&d
  }
  else{
    b=b;
  }
It could look wasteful, but it is not; and destroys data dependency.

Our codes takes the same time as before, but produces 32 results, so
speed is now 480% the speed of a traditional implementation.


TRICK NUMBER 3: multimedia instructions
---------------------------------------
If our CPU is 32 bit but it can also process larger blocks of data
efficiently (multimedia instructions), we can use them. We only need
logic ops and these are typically available.

We can use MMX and work on 64 packets, or SSE and work on 128 packets.
The speed doesn't automatically double going from 32 to 64 because the
integer registers of the processor are normally faster. However, some
speed is gained in this way.

Multimedia instructions are often used by writing assembler by hand, but
compilers are very good in doing register allocation, loop unrolling and
instruction scheduling, so it is better to write the code in C and use
native multimedia data types (intrinsics).

Depending on number of available registers, execution latency, number of
execution units in the CPU, it may be good to process more than one data
block at the same time, for example 2 64bit MMX values. In this case we
work on 128 bits by simulating a 128 bit op with two consecutive 64 bit
op. This may or may not help (apparently not because x86 architecture
has a small number of registers).

We can also try working on 96 bit, pairing a MMX and an int op, or 192
bit by using MMX and SSE. While this is doable in theory and could
exploit different execution units in the CPU, speed doesn't improve
(because of cache line handling problems inside the CPU, maybe).

Besides int, MMX, SSE, we can use long long int (64 bit) and, why not,
unsigned char.

Using groups of unsigned chars (8 or 16) could give the compiler an
opportunity to insert multimedia instructions automatically. For
example, icc can use one MMX istruction to do
  unsigned char a[8],b[8],c[8];
  for(i=0;i<8;i++){
    a[i]=b[i]&c[i];
  }
Some compilers (like icc) are efficient in this case, but using
intrinsics manually is generally faster.

All these experiments can be easily done if the code is written in a way
which abstracts the data type used. This is not easy but doable, all the
operations on data become (inlined) function calls or preprocessor
macros. Good compilers are able to simplify all the abstraction at
compile time and generate perfect code (gcc is great).

The data abstraction used in the code is called "group".


TRICK NUMBER 4: parallel byteslice
----------------------------------
The bitslice method works wonderfully on the stream cypher, but can't be
applied to the block cypher because of the evil big look up table.

As we have to convert input data from normal to bitslice before starting
processing and from bitslice to normal before output, we convert the
stream cypher output to normal before the block calculations and do the
block stage in a traditional way.

There are some xors in the block cypher; so we arrange bytes from
different packets side by side and use multimedia instructions to work
on many bytes at the same time. This is not exactly bitslice, maybe it
is called byteslice. The conversion routines are similar (just a bit
simpler).

The data type we use to do this in the code is called "batch".

The virtual shift register described in trick number 2 is useful too.

The look up table is the only thing which is done serially one byte at a
time. Luckily if we do it on 32 or 64 bytes the loop is heavily
unrolled, and the compiler and the CPU manage to get a good speed
because there is little dependency between instructions.


TRICK NUMBER 5: efficient bit permutation
-----------------------------------------
The block cypher has a bit permutation part. As we are not in a bit
sliced form at that point, permuting bits in a byte takes 8 masks, 8
and, 7 or; but three bits move in the same direction, so we make it with
6 masks, 6 and, 5 or. Batch processing through multimedia instructions
is applicable too.


TRICK NUMBER 6: efficient normal<->slice conversion
---------------------------------------------------
The bitslice<->normal conversion routines are a sort of transposition
operation, that is you have bits in rows and want them in columns. This
can be done efficiently. For example, transposition of 8 bytes (matrix
of 8x8=64 bits) can be done this way (we want to exchange bit[i][j] with
bit[j][i] and we assume bit 0 is the MSB in the byte):

  // untested code, may be bugged
  unsigned char a[8];
  unsigned char b[8];
  for(i=0;i<8;i++) b[i]=0;
  for(i=0;i<8;i++){
    for(j=0;j<8;j++){
      b[i]|=((a[j]>>(7-i)&1))<<(7-j);
    }
  }

but it is slow (128 shifts, 64 and, 64 or), or

  // untested code, may be bugged
  unsigned char a[8];
  unsigned char b[8];
  for(i=0;i<8;i++) b[i]=0;
  for(i=0;i<8;i++){
    for(j=0;j<8;j++){
      if(a[j]&(1<<(7-i))) b[i]|=1<<(7-j);
    }
  }

but is very very slow (128 shifts, 64 and, 64 or, 128 unpredictable
if!), or using a>>=1 and b<<=1, which gains you nothing, or

  // untested code, may be bugged
  unsigned char a[8];
  unsigned char b[8];
  unsigned char top,bottom;
  for(j=0;j<1;j++){
    for(i=0;i<4;i++){
      top=   a[8*j+i];
      bottom=a[8*j+4+i];
      a[8*j+i]=   (top&0xf0)    |((bottom&0xf0)>>4);
      a[8*j+4+i]=((top&0x0f)<<4)| (bottom&0x0f);
    }
  }
  for(j=0;j<2;j++){
    for(i=0;i<2;i++){
      top=   a[4*j+i];
      bottom=a[4*j+2+i];
      a[4*j+i]  = (top&0xcc)    |((bottom&0xcc)>>2);
      a[4*j+2+i]=((top&0x33)<<2)| (bottom&0x33);
    }
  }
  for(j=0;j<4;j++){
    for(i=0;i<1;i++){
      top=   a[2*j+i];
      bottom=a[2*j+1+i];
      a[2*j+i]  = (top&0xaa)    |((bottom&0xaa)>>1);
      a[2*j+1+i]=((top&0x55)<<1)| (bottom&0x55);
    }
  }
  for(i=0;i<8;i++) b[i]=a[i]; //easy to integrate into one of the stages above

which is very fast (24 shifts, 48 and, 24 or) and has redundant loops
and address calculations which will be optimized away by the compiler.
It can be written as 3 nested loops but it becomes less readable and
makes it difficult to have results in b without an extra copy. The
compiler always unrolls heavily.

The gain is much bigger when operating with 32 bit or 64 bit values (we
are going from N^2 to Nlog(N)). This method is used for rectangular
matrixes too (they have to be seen as square matrixes side by side).
Warning: this code is not *endian independent* if you use ints to work
on 4 bytes. Running it on a big endian processor will give you a
different and strange kind of bit rotation if you don't modify masks and
shifts.

This is done in the code using int or long long int. It should be
possible to use MMX instead of long long int and it could be faster, but
this code doesn't cost a great fraction of the total time. There are
problems with the shifts, as multimedia instructions do not have all
possible kind of shift we need (SSE has none!).


TRICK NUMBER 7: try hard to process packets together
----------------------------------------------------
As we are able to process many packets together, we have to avoid
running with many slots empty. Processing one packet or 64 packets takes
the same time if the internal parallelism is 64! So we try hard to
aggregate packets that can be processed together; for simplicity reasons
we don't mix packets with even and odd parity (different keys), even if
it should be doable with a little effort. Sometimes the transition from
even to odd parity and viceversa is not sharp, but there are sequences
like EEEEEOEEOEEOOOO. We try to group all the E together even if there
are O between them. This out-of-order processing complicates the
interface to the applications a bit but saves us three or four runs with
many empty slots.

We have also logic to process together packets with a different size of
the payload, which is not always 184 bytes. This involves sorting the
packets by size before processing and careful operation of the 23
iteration loop to exclude some packets from the calculations. It is not
CPU heavy.

Packets with payload <8 bytes are identical before and after decryption
(!), so we skip them without using a slot. (according to DVB specs these
kind of packets shouldn't happen, but they are used in the real world).


TRICK NUMBER 8: try to avoid doing the same thing many times
------------------------------------------------------------
Some calculations related to keys are only done when the keys are set,
then all the values depending on keys are stored in a convenient form
and used everytime we convert a group of packets.


TRICK NUMBER 9: compiler
------------------------

Compilers have a lot of optimization options. I used -march to target my
CPU and played with unsual options. In particular
  "--param max-unrolled-insns=500"
does a good job on the tricky table lookup in the block cypher. Bigger
values unroll too much somewhere and loose speed. All the testing has
been done on an AthlonXP CPU with a specific version of gcc
  gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)
Other combinations of CPU and compiler can give different speeds. If the
compiler is not able to simplify the group and batch structures and
stores everything in memory instead of registers, performance will be
low.

Absolutely use a good compiler!

Note: the same code can be compiled in C or C++ mode. g++ gives a 3%
speed increase compared to gcc (I suppose some stricter constraint on
array and pointers in C++ mode gives the optimizer more freedom).


TRICK NUMBER a: a lot of brain work
-----------------------------------
The code started as very slow but correct implementation and was then
tweaked for months with a lot of experimentation and by adding all the
good ideas one after another to achieve little steps toward the best
speed possible, while continously testing that nothing had been broken.

Many hours were spent on this code.

Enjoy the result.