tsdecrypt reads and decrypts CSA encrypted incoming mpeg transport stream over UDP/RTP using code words obtained from OSCAM or similar CAM server. tsdecrypt communicates with CAM server using cs378x (camd35 over tcp) protocol or newcamd protocol. https://georgi.unixsol.org/programs/tsdecrypt/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

technical_background.txt 13KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341
  1. -------
  2. FFdecsa
  3. -------
  4. This doc is for people who looked into the source code and found it
  5. difficult to believe that this is a decsa algorithm, as it appears
  6. completely different from other decsa implementations.
  7. It appears different because it is different. Being different is what
  8. enables it to be a lot faster than all the others (currently it has more
  9. than 800% the speed of the best version I was able to find)
  10. The csa algo was designed to be run in hardware, but people are now
  11. running it in software.
  12. Hardware has data lines carrying bits and functional blocks doing
  13. calculations (logic operations, adders, shifters, table lookup, ...),
  14. software instead uses memory to contain data values and executes a
  15. sequence of instructions to transform the values. As a consequence,
  16. writing a software implementation of a hardware algorithm can be
  17. inefficient.
  18. For example, if you have 32 data lines, you can permutate the bits with
  19. zero cost in hardware (you just permute the physical traces), but if you
  20. have the bits in a 32 bit variable you have to use 32 "and" operations
  21. with 32 different masks, 32 shifts and 31 "or" operations (if you
  22. suggest using "if"s testing the bits one by one you know nothing about
  23. how jump prediction works in modern processors).
  24. So the approach is *emulating the hardware*.
  25. Then there are some additional cool tricks.
  26. TRICK NUMBER 0: emulate the hardware
  27. ------------------------------------
  28. We will work on bits one by one, that is a 4 bit word is now four
  29. variables. In this way we revert complex software operations into
  30. hardware emulation:
  31. software hardware
  32. -------------------------------------------
  33. copy values copy values
  34. logic op logic op
  35. (bit permut.) ands+shifts+ors copy values
  36. additions logic op emulating adders
  37. (comparisons) if logic op selecting one of the two results
  38. lookup tables logic op synthetizing a ROM (*)
  39. (*) sometimes lookup tables can be converted to logic expressions
  40. The sbox in the stream cypher have been converted to efficient logic
  41. operations using a custom written software (look into logic directory)
  42. and is responsible for a lot of speed increase. Maybe there exists a
  43. slightly better way to express the sbox as logical expressions, but it
  44. would be a minuscule improvement. The sbox in the block cypher can't be
  45. converted to efficient logic operations (8 bits of inputs are just too
  46. much) and is implemeted with a traditional lookup in an array.
  47. But there is a problem; if we want to process bits, but our external
  48. input and output wants bytes. We need conversion routines. Conversion
  49. routines are similar to the awful permutations we described before, so
  50. this has to be done efficiently someway.
  51. TRICK NUMBER 1: virtual shift registers
  52. ---------------------------------------
  53. Shift registers are normally implemented by moving all data around.
  54. Better leave the data in the same memory locations and redefine where
  55. the start of the register is (updating a pointer). That is called
  56. virtual shift register.
  57. TRICK NUMBER 2: parallel bitslice
  58. ---------------------------------
  59. Implementing the algorithm as described in tricks 1 and 2 give us about
  60. 15% of the speed of a traditional implementation. This happens because
  61. we work on only one bit, even if our CPU is 32 bit wide. But *we can
  62. process 32 different packets at the same time*. This is called
  63. "bitslice" method. It can be done only if the program flow is not
  64. dependent of the data (if, while,...). Luckily this is true.
  65. Things like
  66. if(a){
  67. b=c&d;
  68. }
  69. else{
  70. b=e&f;
  71. }
  72. can be coded as (think of how hardware would implement this)
  73. b1=c&d;
  74. b2=e&f;
  75. b=b2^(a&(b1^b2));
  76. and things like
  77. if(a){
  78. b=c&d
  79. }
  80. can be transformed in the same way, as they may be written as
  81. if(a){
  82. b=c&d
  83. }
  84. else{
  85. b=b;
  86. }
  87. It could look wasteful, but it is not; and destroys data dependency.
  88. Our codes takes the same time as before, but produces 32 results, so
  89. speed is now 480% the speed of a traditional implementation.
  90. TRICK NUMBER 3: multimedia instructions
  91. ---------------------------------------
  92. If our CPU is 32 bit but it can also process larger blocks of data
  93. efficiently (multimedia instructions), we can use them. We only need
  94. logic ops and these are typically available.
  95. We can use MMX and work on 64 packets, or SSE and work on 128 packets.
  96. The speed doesn't automatically double going from 32 to 64 because the
  97. integer registers of the processor are normally faster. However, some
  98. speed is gained in this way.
  99. Multimedia instructions are often used by writing assembler by hand, but
  100. compilers are very good in doing register allocation, loop unrolling and
  101. instruction scheduling, so it is better to write the code in C and use
  102. native multimedia data types (intrinsics).
  103. Depending on number of available registers, execution latency, number of
  104. execution units in the CPU, it may be good to process more than one data
  105. block at the same time, for example 2 64bit MMX values. In this case we
  106. work on 128 bits by simulating a 128 bit op with two consecutive 64 bit
  107. op. This may or may not help (apparently not because x86 architecture
  108. has a small number of registers).
  109. We can also try working on 96 bit, pairing a MMX and an int op, or 192
  110. bit by using MMX and SSE. While this is doable in theory and could
  111. exploit different execution units in the CPU, speed doesn't improve
  112. (because of cache line handling problems inside the CPU, maybe).
  113. Besides int, MMX, SSE, we can use long long int (64 bit) and, why not,
  114. unsigned char.
  115. Using groups of unsigned chars (8 or 16) could give the compiler an
  116. opportunity to insert multimedia instructions automatically. For
  117. example, icc can use one MMX istruction to do
  118. unsigned char a[8],b[8],c[8];
  119. for(i=0;i<8;i++){
  120. a[i]=b[i]&c[i];
  121. }
  122. Some compilers (like icc) are efficient in this case, but using
  123. intrinsics manually is generally faster.
  124. All these experiments can be easily done if the code is written in a way
  125. which abstracts the data type used. This is not easy but doable, all the
  126. operations on data become (inlined) function calls or preprocessor
  127. macros. Good compilers are able to simplify all the abstraction at
  128. compile time and generate perfect code (gcc is great).
  129. The data abstraction used in the code is called "group".
  130. TRICK NUMBER 4: parallel byteslice
  131. ----------------------------------
  132. The bitslice method works wonderfully on the stream cypher, but can't be
  133. applied to the block cypher because of the evil big look up table.
  134. As we have to convert input data from normal to bitslice before starting
  135. processing and from bitslice to normal before output, we convert the
  136. stream cypher output to normal before the block calculations and do the
  137. block stage in a traditional way.
  138. There are some xors in the block cypher; so we arrange bytes from
  139. different packets side by side and use multimedia instructions to work
  140. on many bytes at the same time. This is not exactly bitslice, maybe it
  141. is called byteslice. The conversion routines are similar (just a bit
  142. simpler).
  143. The data type we use to do this in the code is called "batch".
  144. The virtual shift register described in trick number 2 is useful too.
  145. The look up table is the only thing which is done serially one byte at a
  146. time. Luckily if we do it on 32 or 64 bytes the loop is heavily
  147. unrolled, and the compiler and the CPU manage to get a good speed
  148. because there is little dependency between instructions.
  149. TRICK NUMBER 5: efficient bit permutation
  150. -----------------------------------------
  151. The block cypher has a bit permutation part. As we are not in a bit
  152. sliced form at that point, permuting bits in a byte takes 8 masks, 8
  153. and, 7 or; but three bits move in the same direction, so we make it with
  154. 6 masks, 6 and, 5 or. Batch processing through multimedia instructions
  155. is applicable too.
  156. TRICK NUMBER 6: efficient normal<->slice conversion
  157. ---------------------------------------------------
  158. The bitslice<->normal conversion routines are a sort of transposition
  159. operation, that is you have bits in rows and want them in columns. This
  160. can be done efficiently. For example, transposition of 8 bytes (matrix
  161. of 8x8=64 bits) can be done this way (we want to exchange bit[i][j] with
  162. bit[j][i] and we assume bit 0 is the MSB in the byte):
  163. // untested code, may be bugged
  164. unsigned char a[8];
  165. unsigned char b[8];
  166. for(i=0;i<8;i++) b[i]=0;
  167. for(i=0;i<8;i++){
  168. for(j=0;j<8;j++){
  169. b[i]|=((a[j]>>(7-i)&1))<<(7-j);
  170. }
  171. }
  172. but it is slow (128 shifts, 64 and, 64 or), or
  173. // untested code, may be bugged
  174. unsigned char a[8];
  175. unsigned char b[8];
  176. for(i=0;i<8;i++) b[i]=0;
  177. for(i=0;i<8;i++){
  178. for(j=0;j<8;j++){
  179. if(a[j]&(1<<(7-i))) b[i]|=1<<(7-j);
  180. }
  181. }
  182. but is very very slow (128 shifts, 64 and, 64 or, 128 unpredictable
  183. if!), or using a>>=1 and b<<=1, which gains you nothing, or
  184. // untested code, may be bugged
  185. unsigned char a[8];
  186. unsigned char b[8];
  187. unsigned char top,bottom;
  188. for(j=0;j<1;j++){
  189. for(i=0;i<4;i++){
  190. top= a[8*j+i];
  191. bottom=a[8*j+4+i];
  192. a[8*j+i]= (top&0xf0) |((bottom&0xf0)>>4);
  193. a[8*j+4+i]=((top&0x0f)<<4)| (bottom&0x0f);
  194. }
  195. }
  196. for(j=0;j<2;j++){
  197. for(i=0;i<2;i++){
  198. top= a[4*j+i];
  199. bottom=a[4*j+2+i];
  200. a[4*j+i] = (top&0xcc) |((bottom&0xcc)>>2);
  201. a[4*j+2+i]=((top&0x33)<<2)| (bottom&0x33);
  202. }
  203. }
  204. for(j=0;j<4;j++){
  205. for(i=0;i<1;i++){
  206. top= a[2*j+i];
  207. bottom=a[2*j+1+i];
  208. a[2*j+i] = (top&0xaa) |((bottom&0xaa)>>1);
  209. a[2*j+1+i]=((top&0x55)<<1)| (bottom&0x55);
  210. }
  211. }
  212. for(i=0;i<8;i++) b[i]=a[i]; //easy to integrate into one of the stages above
  213. which is very fast (24 shifts, 48 and, 24 or) and has redundant loops
  214. and address calculations which will be optimized away by the compiler.
  215. It can be written as 3 nested loops but it becomes less readable and
  216. makes it difficult to have results in b without an extra copy. The
  217. compiler always unrolls heavily.
  218. The gain is much bigger when operating with 32 bit or 64 bit values (we
  219. are going from N^2 to Nlog(N)). This method is used for rectangular
  220. matrixes too (they have to be seen as square matrixes side by side).
  221. Warning: this code is not *endian independent* if you use ints to work
  222. on 4 bytes. Running it on a big endian processor will give you a
  223. different and strange kind of bit rotation if you don't modify masks and
  224. shifts.
  225. This is done in the code using int or long long int. It should be
  226. possible to use MMX instead of long long int and it could be faster, but
  227. this code doesn't cost a great fraction of the total time. There are
  228. problems with the shifts, as multimedia instructions do not have all
  229. possible kind of shift we need (SSE has none!).
  230. TRICK NUMBER 7: try hard to process packets together
  231. ----------------------------------------------------
  232. As we are able to process many packets together, we have to avoid
  233. running with many slots empty. Processing one packet or 64 packets takes
  234. the same time if the internal parallelism is 64! So we try hard to
  235. aggregate packets that can be processed together; for simplicity reasons
  236. we don't mix packets with even and odd parity (different keys), even if
  237. it should be doable with a little effort. Sometimes the transition from
  238. even to odd parity and viceversa is not sharp, but there are sequences
  239. like EEEEEOEEOEEOOOO. We try to group all the E together even if there
  240. are O between them. This out-of-order processing complicates the
  241. interface to the applications a bit but saves us three or four runs with
  242. many empty slots.
  243. We have also logic to process together packets with a different size of
  244. the payload, which is not always 184 bytes. This involves sorting the
  245. packets by size before processing and careful operation of the 23
  246. iteration loop to exclude some packets from the calculations. It is not
  247. CPU heavy.
  248. Packets with payload <8 bytes are identical before and after decryption
  249. (!), so we skip them without using a slot. (according to DVB specs these
  250. kind of packets shouldn't happen, but they are used in the real world).
  251. TRICK NUMBER 8: try to avoid doing the same thing many times
  252. ------------------------------------------------------------
  253. Some calculations related to keys are only done when the keys are set,
  254. then all the values depending on keys are stored in a convenient form
  255. and used everytime we convert a group of packets.
  256. TRICK NUMBER 9: compiler
  257. ------------------------
  258. Compilers have a lot of optimization options. I used -march to target my
  259. CPU and played with unsual options. In particular
  260. "--param max-unrolled-insns=500"
  261. does a good job on the tricky table lookup in the block cypher. Bigger
  262. values unroll too much somewhere and loose speed. All the testing has
  263. been done on an AthlonXP CPU with a specific version of gcc
  264. gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)
  265. Other combinations of CPU and compiler can give different speeds. If the
  266. compiler is not able to simplify the group and batch structures and
  267. stores everything in memory instead of registers, performance will be
  268. low.
  269. Absolutely use a good compiler!
  270. Note: the same code can be compiled in C or C++ mode. g++ gives a 3%
  271. speed increase compared to gcc (I suppose some stricter constraint on
  272. array and pointers in C++ mode gives the optimizer more freedom).
  273. TRICK NUMBER a: a lot of brain work
  274. -----------------------------------
  275. The code started as very slow but correct implementation and was then
  276. tweaked for months with a lot of experimentation and by adding all the
  277. good ideas one after another to achieve little steps toward the best
  278. speed possible, while continously testing that nothing had been broken.
  279. Many hours were spent on this code.
  280. Enjoy the result.