Browse Source

Add FFdecsa in FFdecsa/ directory.

The code is GPLv2+ and is extracted from getstream a84 build.
Georgi Chorbadzhiyski 12 years ago
parent
commit
423579774e

+ 339
- 0
FFdecsa/COPYING View File

@@ -0,0 +1,339 @@
1
+		    GNU GENERAL PUBLIC LICENSE
2
+		       Version 2, June 1991
3
+
4
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
5
+                          675 Mass Ave, Cambridge, MA 02139, USA
6
+ Everyone is permitted to copy and distribute verbatim copies
7
+ of this license document, but changing it is not allowed.
8
+
9
+			    Preamble
10
+
11
+  The licenses for most software are designed to take away your
12
+freedom to share and change it.  By contrast, the GNU General Public
13
+License is intended to guarantee your freedom to share and change free
14
+software--to make sure the software is free for all its users.  This
15
+General Public License applies to most of the Free Software
16
+Foundation's software and to any other program whose authors commit to
17
+using it.  (Some other Free Software Foundation software is covered by
18
+the GNU Library General Public License instead.)  You can apply it to
19
+your programs, too.
20
+
21
+  When we speak of free software, we are referring to freedom, not
22
+price.  Our General Public Licenses are designed to make sure that you
23
+have the freedom to distribute copies of free software (and charge for
24
+this service if you wish), that you receive source code or can get it
25
+if you want it, that you can change the software or use pieces of it
26
+in new free programs; and that you know you can do these things.
27
+
28
+  To protect your rights, we need to make restrictions that forbid
29
+anyone to deny you these rights or to ask you to surrender the rights.
30
+These restrictions translate to certain responsibilities for you if you
31
+distribute copies of the software, or if you modify it.
32
+
33
+  For example, if you distribute copies of such a program, whether
34
+gratis or for a fee, you must give the recipients all the rights that
35
+you have.  You must make sure that they, too, receive or can get the
36
+source code.  And you must show them these terms so they know their
37
+rights.
38
+
39
+  We protect your rights with two steps: (1) copyright the software, and
40
+(2) offer you this license which gives you legal permission to copy,
41
+distribute and/or modify the software.
42
+
43
+  Also, for each author's protection and ours, we want to make certain
44
+that everyone understands that there is no warranty for this free
45
+software.  If the software is modified by someone else and passed on, we
46
+want its recipients to know that what they have is not the original, so
47
+that any problems introduced by others will not reflect on the original
48
+authors' reputations.
49
+
50
+  Finally, any free program is threatened constantly by software
51
+patents.  We wish to avoid the danger that redistributors of a free
52
+program will individually obtain patent licenses, in effect making the
53
+program proprietary.  To prevent this, we have made it clear that any
54
+patent must be licensed for everyone's free use or not licensed at all.
55
+
56
+  The precise terms and conditions for copying, distribution and
57
+modification follow.
58
+
59
+		    GNU GENERAL PUBLIC LICENSE
60
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
61
+
62
+  0. This License applies to any program or other work which contains
63
+a notice placed by the copyright holder saying it may be distributed
64
+under the terms of this General Public License.  The "Program", below,
65
+refers to any such program or work, and a "work based on the Program"
66
+means either the Program or any derivative work under copyright law:
67
+that is to say, a work containing the Program or a portion of it,
68
+either verbatim or with modifications and/or translated into another
69
+language.  (Hereinafter, translation is included without limitation in
70
+the term "modification".)  Each licensee is addressed as "you".
71
+
72
+Activities other than copying, distribution and modification are not
73
+covered by this License; they are outside its scope.  The act of
74
+running the Program is not restricted, and the output from the Program
75
+is covered only if its contents constitute a work based on the
76
+Program (independent of having been made by running the Program).
77
+Whether that is true depends on what the Program does.
78
+
79
+  1. You may copy and distribute verbatim copies of the Program's
80
+source code as you receive it, in any medium, provided that you
81
+conspicuously and appropriately publish on each copy an appropriate
82
+copyright notice and disclaimer of warranty; keep intact all the
83
+notices that refer to this License and to the absence of any warranty;
84
+and give any other recipients of the Program a copy of this License
85
+along with the Program.
86
+
87
+You may charge a fee for the physical act of transferring a copy, and
88
+you may at your option offer warranty protection in exchange for a fee.
89
+
90
+  2. You may modify your copy or copies of the Program or any portion
91
+of it, thus forming a work based on the Program, and copy and
92
+distribute such modifications or work under the terms of Section 1
93
+above, provided that you also meet all of these conditions:
94
+
95
+    a) You must cause the modified files to carry prominent notices
96
+    stating that you changed the files and the date of any change.
97
+
98
+    b) You must cause any work that you distribute or publish, that in
99
+    whole or in part contains or is derived from the Program or any
100
+    part thereof, to be licensed as a whole at no charge to all third
101
+    parties under the terms of this License.
102
+
103
+    c) If the modified program normally reads commands interactively
104
+    when run, you must cause it, when started running for such
105
+    interactive use in the most ordinary way, to print or display an
106
+    announcement including an appropriate copyright notice and a
107
+    notice that there is no warranty (or else, saying that you provide
108
+    a warranty) and that users may redistribute the program under
109
+    these conditions, and telling the user how to view a copy of this
110
+    License.  (Exception: if the Program itself is interactive but
111
+    does not normally print such an announcement, your work based on
112
+    the Program is not required to print an announcement.)
113
+
114
+These requirements apply to the modified work as a whole.  If
115
+identifiable sections of that work are not derived from the Program,
116
+and can be reasonably considered independent and separate works in
117
+themselves, then this License, and its terms, do not apply to those
118
+sections when you distribute them as separate works.  But when you
119
+distribute the same sections as part of a whole which is a work based
120
+on the Program, the distribution of the whole must be on the terms of
121
+this License, whose permissions for other licensees extend to the
122
+entire whole, and thus to each and every part regardless of who wrote it.
123
+
124
+Thus, it is not the intent of this section to claim rights or contest
125
+your rights to work written entirely by you; rather, the intent is to
126
+exercise the right to control the distribution of derivative or
127
+collective works based on the Program.
128
+
129
+In addition, mere aggregation of another work not based on the Program
130
+with the Program (or with a work based on the Program) on a volume of
131
+a storage or distribution medium does not bring the other work under
132
+the scope of this License.
133
+
134
+  3. You may copy and distribute the Program (or a work based on it,
135
+under Section 2) in object code or executable form under the terms of
136
+Sections 1 and 2 above provided that you also do one of the following:
137
+
138
+    a) Accompany it with the complete corresponding machine-readable
139
+    source code, which must be distributed under the terms of Sections
140
+    1 and 2 above on a medium customarily used for software interchange; or,
141
+
142
+    b) Accompany it with a written offer, valid for at least three
143
+    years, to give any third party, for a charge no more than your
144
+    cost of physically performing source distribution, a complete
145
+    machine-readable copy of the corresponding source code, to be
146
+    distributed under the terms of Sections 1 and 2 above on a medium
147
+    customarily used for software interchange; or,
148
+
149
+    c) Accompany it with the information you received as to the offer
150
+    to distribute corresponding source code.  (This alternative is
151
+    allowed only for noncommercial distribution and only if you
152
+    received the program in object code or executable form with such
153
+    an offer, in accord with Subsection b above.)
154
+
155
+The source code for a work means the preferred form of the work for
156
+making modifications to it.  For an executable work, complete source
157
+code means all the source code for all modules it contains, plus any
158
+associated interface definition files, plus the scripts used to
159
+control compilation and installation of the executable.  However, as a
160
+special exception, the source code distributed need not include
161
+anything that is normally distributed (in either source or binary
162
+form) with the major components (compiler, kernel, and so on) of the
163
+operating system on which the executable runs, unless that component
164
+itself accompanies the executable.
165
+
166
+If distribution of executable or object code is made by offering
167
+access to copy from a designated place, then offering equivalent
168
+access to copy the source code from the same place counts as
169
+distribution of the source code, even though third parties are not
170
+compelled to copy the source along with the object code.
171
+
172
+  4. You may not copy, modify, sublicense, or distribute the Program
173
+except as expressly provided under this License.  Any attempt
174
+otherwise to copy, modify, sublicense or distribute the Program is
175
+void, and will automatically terminate your rights under this License.
176
+However, parties who have received copies, or rights, from you under
177
+this License will not have their licenses terminated so long as such
178
+parties remain in full compliance.
179
+
180
+  5. You are not required to accept this License, since you have not
181
+signed it.  However, nothing else grants you permission to modify or
182
+distribute the Program or its derivative works.  These actions are
183
+prohibited by law if you do not accept this License.  Therefore, by
184
+modifying or distributing the Program (or any work based on the
185
+Program), you indicate your acceptance of this License to do so, and
186
+all its terms and conditions for copying, distributing or modifying
187
+the Program or works based on it.
188
+
189
+  6. Each time you redistribute the Program (or any work based on the
190
+Program), the recipient automatically receives a license from the
191
+original licensor to copy, distribute or modify the Program subject to
192
+these terms and conditions.  You may not impose any further
193
+restrictions on the recipients' exercise of the rights granted herein.
194
+You are not responsible for enforcing compliance by third parties to
195
+this License.
196
+
197
+  7. If, as a consequence of a court judgment or allegation of patent
198
+infringement or for any other reason (not limited to patent issues),
199
+conditions are imposed on you (whether by court order, agreement or
200
+otherwise) that contradict the conditions of this License, they do not
201
+excuse you from the conditions of this License.  If you cannot
202
+distribute so as to satisfy simultaneously your obligations under this
203
+License and any other pertinent obligations, then as a consequence you
204
+may not distribute the Program at all.  For example, if a patent
205
+license would not permit royalty-free redistribution of the Program by
206
+all those who receive copies directly or indirectly through you, then
207
+the only way you could satisfy both it and this License would be to
208
+refrain entirely from distribution of the Program.
209
+
210
+If any portion of this section is held invalid or unenforceable under
211
+any particular circumstance, the balance of the section is intended to
212
+apply and the section as a whole is intended to apply in other
213
+circumstances.
214
+
215
+It is not the purpose of this section to induce you to infringe any
216
+patents or other property right claims or to contest validity of any
217
+such claims; this section has the sole purpose of protecting the
218
+integrity of the free software distribution system, which is
219
+implemented by public license practices.  Many people have made
220
+generous contributions to the wide range of software distributed
221
+through that system in reliance on consistent application of that
222
+system; it is up to the author/donor to decide if he or she is willing
223
+to distribute software through any other system and a licensee cannot
224
+impose that choice.
225
+
226
+This section is intended to make thoroughly clear what is believed to
227
+be a consequence of the rest of this License.
228
+
229
+  8. If the distribution and/or use of the Program is restricted in
230
+certain countries either by patents or by copyrighted interfaces, the
231
+original copyright holder who places the Program under this License
232
+may add an explicit geographical distribution limitation excluding
233
+those countries, so that distribution is permitted only in or among
234
+countries not thus excluded.  In such case, this License incorporates
235
+the limitation as if written in the body of this License.
236
+
237
+  9. The Free Software Foundation may publish revised and/or new versions
238
+of the General Public License from time to time.  Such new versions will
239
+be similar in spirit to the present version, but may differ in detail to
240
+address new problems or concerns.
241
+
242
+Each version is given a distinguishing version number.  If the Program
243
+specifies a version number of this License which applies to it and "any
244
+later version", you have the option of following the terms and conditions
245
+either of that version or of any later version published by the Free
246
+Software Foundation.  If the Program does not specify a version number of
247
+this License, you may choose any version ever published by the Free Software
248
+Foundation.
249
+
250
+  10. If you wish to incorporate parts of the Program into other free
251
+programs whose distribution conditions are different, write to the author
252
+to ask for permission.  For software which is copyrighted by the Free
253
+Software Foundation, write to the Free Software Foundation; we sometimes
254
+make exceptions for this.  Our decision will be guided by the two goals
255
+of preserving the free status of all derivatives of our free software and
256
+of promoting the sharing and reuse of software generally.
257
+
258
+			    NO WARRANTY
259
+
260
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
262
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
266
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
267
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268
+REPAIR OR CORRECTION.
269
+
270
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278
+POSSIBILITY OF SUCH DAMAGES.
279
+
280
+		     END OF TERMS AND CONDITIONS
281
+
282
+	Appendix: How to Apply These Terms to Your New Programs
283
+
284
+  If you develop a new program, and you want it to be of the greatest
285
+possible use to the public, the best way to achieve this is to make it
286
+free software which everyone can redistribute and change under these terms.
287
+
288
+  To do so, attach the following notices to the program.  It is safest
289
+to attach them to the start of each source file to most effectively
290
+convey the exclusion of warranty; and each file should have at least
291
+the "copyright" line and a pointer to where the full notice is found.
292
+
293
+    <one line to give the program's name and a brief idea of what it does.>
294
+    Copyright (C) 19yy  <name of author>
295
+
296
+    This program is free software; you can redistribute it and/or modify
297
+    it under the terms of the GNU General Public License as published by
298
+    the Free Software Foundation; either version 2 of the License, or
299
+    (at your option) any later version.
300
+
301
+    This program is distributed in the hope that it will be useful,
302
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
303
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
304
+    GNU General Public License for more details.
305
+
306
+    You should have received a copy of the GNU General Public License
307
+    along with this program; if not, write to the Free Software
308
+    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
309
+
310
+Also add information on how to contact you by electronic and paper mail.
311
+
312
+If the program is interactive, make it output a short notice like this
313
+when it starts in an interactive mode:
314
+
315
+    Gnomovision version 69, Copyright (C) 19yy name of author
316
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
317
+    This is free software, and you are welcome to redistribute it
318
+    under certain conditions; type `show c' for details.
319
+
320
+The hypothetical commands `show w' and `show c' should show the appropriate
321
+parts of the General Public License.  Of course, the commands you use may
322
+be called something other than `show w' and `show c'; they could even be
323
+mouse-clicks or menu items--whatever suits your program.
324
+
325
+You should also get your employer (if you work as a programmer) or your
326
+school, if any, to sign a "copyright disclaimer" for the program, if
327
+necessary.  Here is a sample; alter the names:
328
+
329
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
330
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
331
+
332
+  <signature of Ty Coon>, 1 April 1989
333
+  Ty Coon, President of Vice
334
+
335
+This General Public License does not permit incorporating your program into
336
+proprietary programs.  If your program is a subroutine library, you may
337
+consider it more useful to permit linking proprietary applications with the
338
+library.  If this is what you want to do, use the GNU Library General
339
+Public License instead of this License.

+ 206
- 0
FFdecsa/ChangeLog View File

@@ -0,0 +1,206 @@
1
+- created
2
+
3
+- released 0.0.1
4
+
5
+- simplified s, A, B
6
+
7
+- released 0.0.2
8
+
9
+- simplified nxt=
10
+
11
+- released 0.0.3
12
+
13
+- removed commented code
14
+- code formatting
15
+
16
+- released 0.0.4
17
+
18
+- kk now unsigned char
19
+- removed 64 bit ints
20
+
21
+- released 0.0.5
22
+
23
+- created decrypt_2ts
24
+
25
+- released 0.0.6
26
+
27
+- renamed files
28
+- created decrypt_many_ts, removed others
29
+- external interface has 2 functions only: set_cws() and decrypt_many_ts()
30
+- reformatted code
31
+- reimplemented s12,s34,s56,s7
32
+- unsigned char become int for table optimization
33
+
34
+- released 0.0.7
35
+
36
+- optional icc compiler
37
+- kk now 0..55
38
+- decrypt_many_ts really works (no parallelism yet)
39
+- added get_cws() to interface
40
+- created stream.c
41
+- created key_schedule_stream, using iA[] and iB[]
42
+
43
+- released 0.0.8
44
+
45
+- decrypt_many_ts() makes a group, sorts the packets, processes them
46
+- preliminar stream_cypher_group() created
47
+- parallel computing activated
48
+- huge speed increase (+500%) thanks to stream_cypher_group()
49
+
50
+- released 0.0.9
51
+
52
+- block_cypher_group() created (no parallelism yet)
53
+
54
+- released 0.0.10
55
+
56
+- block_cypher_group() has 56 simple iterations
57
+- block_cypher_group() doesn't shift registers anymore
58
+
59
+- released 0.0.11
60
+
61
+- some parallelization on block_cypher_group()
62
+
63
+- released 0.0.12
64
+
65
+- better parallelization of block_cypher_group()
66
+
67
+- released 0.0.13
68
+
69
+- block_cypher() was still called by error when N=23
70
+- speed is now 109Mbit/s on AMD XP2000+ CPU
71
+
72
+- released 0.0.14
73
+
74
+- stream_cypher_group() has a init and normal variant
75
+- A[0]-A[9] instead of A[1]-A[10], same for B
76
+- implemented virtual shift of A and B
77
+- speed is now 117Mbit/s on AMD XP2000+ CPU
78
+
79
+- released 0.0.15
80
+
81
+- better optimization of E and F in the stream cypher
82
+- speed is now 119Mbit/s on AMD XP2000+ CPU
83
+
84
+- released 0.0.16
85
+
86
+- removed some debug overhead
87
+- speed is now 120Mbit/s on AMD XP2000+ CPU
88
+
89
+- released 0.0.17
90
+
91
+- don't move packets with residue anymore
92
+- speed is now 123Mbit/s on AMD XP2000+ CPU
93
+
94
+- released 0.0.18
95
+
96
+- solved alignment problems
97
+- search groupable packets even beyond ungroupable ones
98
+  (more speed in some real world cases)
99
+- created decrypt_many_ts2(), useful with circular buffers
100
+
101
+- released 0.0.19
102
+
103
+- removed old code
104
+
105
+- released 0.0.20
106
+
107
+- partially converted code to size-independent group
108
+- icc doesn't work with optimizations on
109
+
110
+- released 0.1.1
111
+
112
+- merge loops on block_decypher (speed++ gcc, speed-- icc)
113
+- transposition are now functions (speed-- icc)
114
+- icc works again (compiler bug work around?)
115
+
116
+- released 0.1.2
117
+
118
+- better use of COPY8 &co
119
+- better flags for gcc
120
+- removed old code
121
+
122
+- released 0.1.3
123
+
124
+- int and not char in block cypher (speed++++++ gcc, speed-- icc)
125
+
126
+- released 0.1.4
127
+
128
+- group abstraction finally implemented
129
+- support for group width 64
130
+
131
+- released 0.1.5
132
+
133
+- group 64 mmx implemented (speed++ gcc)
134
+
135
+- released 0.1.6
136
+
137
+- more parallelism in block cypher (speed++ gcc)
138
+- transposition before and after block (disabled because of no speed gain yet)
139
+
140
+- released 0.1.7
141
+
142
+- more parallelism in block cypher (speed++ gcc)
143
+- transposition before and after block enabled (speed++ gcc)
144
+- gcc options (unrolled 500) speed gcc++
145
+
146
+- released 0.1.8
147
+
148
+- reworked FFN_ALL_* constants (speed++++ gcc) 
149
+
150
+- released 0.1.9
151
+
152
+- transposition in block as inlined functions
153
+- group abstraction working well
154
+
155
+- released 0.1.10
156
+
157
+- group 128 sse implemented, but batch is 64 mmx (not faster than group 64 mmx)
158
+
159
+- released 0.1.11
160
+
161
+- lot of code polishing and dead code elimination
162
+- better and more debug output
163
+
164
+- released 0.1.12
165
+
166
+- name change: FFdecsa
167
+
168
+- released 0.2.0
169
+
170
+- separated test cases
171
+- corrected all group_modes (now called parallel_modes)
172
+- parallel 128 8 char implemented
173
+- parallel 64 long implemented
174
+- parallel 128 2 long implemented
175
+- parallel 128 2 mmx implemented (incredibly slow, the compiler is very confused)
176
+- parallel 128 16 charA implemented (very slow compilation)
177
+- parallel 128 16 char implemented
178
+- renamed softcsa* to FFdecsa*
179
+
180
+- released 0.2.1
181
+
182
+- new external interface (based on ranges)
183
+
184
+- released 0.2.2
185
+
186
+- can be compiled with g++ too
187
+- using g++ the code is 3% faster!
188
+- external interface: function name changing and new functions
189
+- a group of ranges is now called a cluster
190
+- renamed autogenerated files
191
+
192
+- released 0.2.3
193
+
194
+- written docs
195
+- removed unneeded files
196
+- added Copyright and license notes
197
+- reworked "logic"
198
+
199
+- released 0.3.0
200
+
201
+- Makefile reworked
202
+- misc fixes
203
+- added vdr patch
204
+
205
+- released 1.0.0 (public release)
206
+

+ 880
- 0
FFdecsa/FFdecsa.c View File

@@ -0,0 +1,880 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+#include <sys/types.h>
22
+#include <string.h>
23
+#include <stdio.h>
24
+#include <stdlib.h>
25
+
26
+#include "FFdecsa.h"
27
+
28
+#ifndef NULL
29
+#define NULL 0
30
+#endif
31
+
32
+//#define DEBUG
33
+#ifdef DEBUG
34
+#define DBG(a) a
35
+#else
36
+#define DBG(a)
37
+#endif
38
+
39
+//// parallelization stuff, large speed differences are possible
40
+// possible choices
41
+#define PARALLEL_32_4CHAR     320
42
+#define PARALLEL_32_4CHARA    321
43
+#define PARALLEL_32_INT       322
44
+#define PARALLEL_64_8CHAR     640
45
+#define PARALLEL_64_8CHARA    641
46
+#define PARALLEL_64_2INT      642
47
+#define PARALLEL_64_LONG      643
48
+#define PARALLEL_64_MMX       644
49
+#define PARALLEL_128_16CHAR  1280
50
+#define PARALLEL_128_16CHARA 1281
51
+#define PARALLEL_128_4INT    1282
52
+#define PARALLEL_128_2LONG   1283
53
+#define PARALLEL_128_2MMX    1284
54
+#define PARALLEL_128_SSE     1285
55
+#define PARALLEL_128_SSE2    1286
56
+
57
+//////// our choice //////////////// our choice //////////////// our choice //////////////// our choice ////////
58
+#ifndef PARALLEL_MODE
59
+#define PARALLEL_MODE PARALLEL_32_INT
60
+#endif
61
+//////// our choice //////////////// our choice //////////////// our choice //////////////// our choice ////////
62
+
63
+#include "parallel_generic.h"
64
+//// conditionals
65
+#if PARALLEL_MODE==PARALLEL_32_4CHAR
66
+#include "parallel_032_4char.h"
67
+#elif PARALLEL_MODE==PARALLEL_32_4CHARA
68
+#include "parallel_032_4charA.h"
69
+#elif PARALLEL_MODE==PARALLEL_32_INT
70
+#include "parallel_032_int.h"
71
+#elif PARALLEL_MODE==PARALLEL_64_8CHAR
72
+#include "parallel_064_8char.h"
73
+#elif PARALLEL_MODE==PARALLEL_64_8CHARA
74
+#include "parallel_064_8charA.h"
75
+#elif PARALLEL_MODE==PARALLEL_64_2INT
76
+#include "parallel_064_2int.h"
77
+#elif PARALLEL_MODE==PARALLEL_64_LONG
78
+#include "parallel_064_long.h"
79
+#elif PARALLEL_MODE==PARALLEL_64_MMX
80
+#include "parallel_064_mmx.h"
81
+#elif PARALLEL_MODE==PARALLEL_128_16CHAR
82
+#include "parallel_128_16char.h"
83
+#elif PARALLEL_MODE==PARALLEL_128_16CHARA
84
+#include "parallel_128_16charA.h"
85
+#elif PARALLEL_MODE==PARALLEL_128_4INT
86
+#include "parallel_128_4int.h"
87
+#elif PARALLEL_MODE==PARALLEL_128_2LONG
88
+#include "parallel_128_2long.h"
89
+#elif PARALLEL_MODE==PARALLEL_128_2MMX
90
+#include "parallel_128_2mmx.h"
91
+#elif PARALLEL_MODE==PARALLEL_128_SSE
92
+#include "parallel_128_sse.h"
93
+#elif PARALLEL_MODE==PARALLEL_128_SSE2
94
+#include "parallel_128_sse2.h"
95
+#else
96
+#error "unknown/undefined parallel mode"
97
+#endif
98
+
99
+// stuff depending on conditionals
100
+
101
+#define BYTES_PER_GROUP (GROUP_PARALLELISM/8)
102
+#define BYPG BYTES_PER_GROUP
103
+#define BITS_PER_GROUP GROUP_PARALLELISM
104
+#define BIPG BITS_PER_GROUP
105
+
106
+#ifndef MALLOC
107
+#define MALLOC(X) malloc(X)
108
+#endif
109
+#ifndef FREE
110
+#define FREE(X) free(X)
111
+#endif
112
+#ifndef MEMALIGN
113
+#define MEMALIGN
114
+#endif
115
+
116
+//// debug tool
117
+
118
+#ifdef DEBUG
119
+static void dump_mem(const char *string, const unsigned char *p, int len, int linelen){
120
+  int i;
121
+  for(i=0;i<len;i++){
122
+    if(i%linelen==0&&i) fprintf(stderr,"\n");
123
+    if(i%linelen==0) fprintf(stderr,"%s %08x:",string,i);
124
+    else{
125
+      if(i%8==0) fprintf(stderr," ");
126
+      if(i%4==0) fprintf(stderr," ");
127
+    }
128
+    fprintf(stderr," %02x",p[i]);
129
+  }
130
+  if(i%linelen==0) fprintf(stderr,"\n");
131
+}
132
+#endif
133
+
134
+//////////////////////////////////////////////////////////////////////////////////
135
+
136
+struct csa_key_t{
137
+	unsigned char ck[8];
138
+// used by stream
139
+        int iA[8];  // iA[0] is for A1, iA[7] is for A8
140
+        int iB[8];  // iB[0] is for B1, iB[7] is for B8
141
+// used by stream (group)
142
+        MEMALIGN group ck_g[8][8]; // [byte][bit:0=LSB,7=MSB]
143
+        MEMALIGN group iA_g[8][4]; // [0 for A1][0 for LSB]
144
+        MEMALIGN group iB_g[8][4]; // [0 for B1][0 for LSB]
145
+// used by block
146
+	unsigned char kk[56];
147
+// used by block (group)
148
+	MEMALIGN batch kkmulti[56]; // many times the same byte in every batch
149
+};
150
+
151
+struct csa_keys_t{
152
+  struct csa_key_t even;
153
+  struct csa_key_t odd;
154
+};
155
+
156
+//-----stream cypher
157
+
158
+//-----key schedule for stream decypher
159
+static void key_schedule_stream(
160
+  unsigned char *ck,    // [In]  ck[0]-ck[7]   8 bytes   | Key.
161
+  int *iA,              // [Out] iA[0]-iA[7]   8 nibbles | Key schedule.
162
+  int *iB)              // [Out] iB[0]-iB[7]   8 nibbles | Key schedule.
163
+{
164
+    iA[0]=(ck[0]>>4)&0xf;
165
+    iA[1]=(ck[0]   )&0xf;
166
+    iA[2]=(ck[1]>>4)&0xf;
167
+    iA[3]=(ck[1]   )&0xf;
168
+    iA[4]=(ck[2]>>4)&0xf;
169
+    iA[5]=(ck[2]   )&0xf;
170
+    iA[6]=(ck[3]>>4)&0xf;
171
+    iA[7]=(ck[3]   )&0xf;
172
+    iB[0]=(ck[4]>>4)&0xf;
173
+    iB[1]=(ck[4]   )&0xf;
174
+    iB[2]=(ck[5]>>4)&0xf;
175
+    iB[3]=(ck[5]   )&0xf;
176
+    iB[4]=(ck[6]>>4)&0xf;
177
+    iB[5]=(ck[6]   )&0xf;
178
+    iB[6]=(ck[7]>>4)&0xf;
179
+    iB[7]=(ck[7]   )&0xf;
180
+}
181
+
182
+//----- stream main function
183
+
184
+#define STREAM_INIT
185
+#include "stream.c"
186
+#undef STREAM_INIT
187
+
188
+#define STREAM_NORMAL
189
+#include "stream.c"
190
+#undef STREAM_NORMAL
191
+
192
+
193
+//-----block decypher
194
+
195
+//-----key schedule for block decypher
196
+
197
+static void key_schedule_block(
198
+  unsigned char *ck,    // [In]  ck[0]-ck[7]   8 bytes | Key.
199
+  unsigned char *kk)    // [Out] kk[0]-kk[55] 56 bytes | Key schedule.
200
+{
201
+  static const unsigned char key_perm[0x40] = {
202
+    0x12,0x24,0x09,0x07,0x2A,0x31,0x1D,0x15, 0x1C,0x36,0x3E,0x32,0x13,0x21,0x3B,0x40,
203
+    0x18,0x14,0x25,0x27,0x02,0x35,0x1B,0x01, 0x22,0x04,0x0D,0x0E,0x39,0x28,0x1A,0x29,
204
+    0x33,0x23,0x34,0x0C,0x16,0x30,0x1E,0x3A, 0x2D,0x1F,0x08,0x19,0x17,0x2F,0x3D,0x11,
205
+    0x3C,0x05,0x38,0x2B,0x0B,0x06,0x0A,0x2C, 0x20,0x3F,0x2E,0x0F,0x03,0x26,0x10,0x37,
206
+  };
207
+
208
+  int i,j,k;
209
+  int bit[64];
210
+  int newbit[64];
211
+  int kb[7][8];
212
+
213
+  // 56 steps
214
+  // 56 key bytes kk(55)..kk(0) by key schedule from ck
215
+
216
+  // kb(6,0) .. kb(6,7) = ck(0) .. ck(7)
217
+  kb[6][0] = ck[0];
218
+  kb[6][1] = ck[1];
219
+  kb[6][2] = ck[2];
220
+  kb[6][3] = ck[3];
221
+  kb[6][4] = ck[4];
222
+  kb[6][5] = ck[5];
223
+  kb[6][6] = ck[6];
224
+  kb[6][7] = ck[7];
225
+
226
+  // calculate kb[5] .. kb[0]
227
+  for(i=5; i>=0; i--){
228
+    // 64 bit perm on kb
229
+    for(j=0; j<8; j++){
230
+      for(k=0; k<8; k++){
231
+        bit[j*8+k] = (kb[i+1][j] >> (7-k)) & 1;
232
+        newbit[key_perm[j*8+k]-1] = bit[j*8+k];
233
+      }
234
+    }
235
+    for(j=0; j<8; j++){
236
+      kb[i][j] = 0;
237
+      for(k=0; k<8; k++){
238
+        kb[i][j] |= newbit[j*8+k] << (7-k);
239
+      }
240
+    }
241
+  }
242
+
243
+  // xor to give kk
244
+  for(i=0; i<7; i++){
245
+    for(j=0; j<8; j++){
246
+      kk[i*8+j] = kb[i][j] ^ i;
247
+    }
248
+  }
249
+
250
+}
251
+
252
+//-----block utils
253
+
254
+static inline __attribute__((always_inline)) void trasp_N_8 (unsigned char *in,unsigned char* out,int count){
255
+  int *ri=(int *)in;
256
+  int *ibi=(int *)out;
257
+  int j,i,k,g;
258
+  // copy and first step
259
+  for(g=0;g<count;g++){
260
+    ri[g]=ibi[2*g];
261
+    ri[GROUP_PARALLELISM+g]=ibi[2*g+1];
262
+  }
263
+//dump_mem("NE1 r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
264
+// now 01230123
265
+#define INTS_PER_ROW (GROUP_PARALLELISM/8*2)
266
+  for(j=0;j<8;j+=4){
267
+    for(i=0;i<2;i++){
268
+      for(k=0;k<INTS_PER_ROW;k++){
269
+        unsigned int t,b;
270
+        t=ri[INTS_PER_ROW*(j+i)+k];
271
+        b=ri[INTS_PER_ROW*(j+i+2)+k];
272
+        ri[INTS_PER_ROW*(j+i)+k]=     (t&0x0000ffff)      | ((b           )<<16);
273
+        ri[INTS_PER_ROW*(j+i+2)+k]=  ((t           )>>16) |  (b&0xffff0000) ;
274
+      }
275
+    }
276
+  }
277
+//dump_mem("NE2 r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
278
+// now 01010101
279
+  for(j=0;j<8;j+=2){
280
+    for(i=0;i<1;i++){
281
+      for(k=0;k<INTS_PER_ROW;k++){
282
+        unsigned int t,b;
283
+        t=ri[INTS_PER_ROW*(j+i)+k];
284
+        b=ri[INTS_PER_ROW*(j+i+1)+k];
285
+        ri[INTS_PER_ROW*(j+i)+k]=     (t&0x00ff00ff)     | ((b&0x00ff00ff)<<8);
286
+        ri[INTS_PER_ROW*(j+i+1)+k]=  ((t&0xff00ff00)>>8) |  (b&0xff00ff00);
287
+      }
288
+    }
289
+  }
290
+//dump_mem("NE3 r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
291
+// now 00000000
292
+}
293
+
294
+static inline __attribute__((always_inline)) void trasp_8_N (unsigned char *in,unsigned char* out,int count){
295
+  int *ri=(int *)in;
296
+  int *bdi=(int *)out;
297
+  int j,i,k,g;
298
+#define INTS_PER_ROW (GROUP_PARALLELISM/8*2)
299
+//dump_mem("NE1 r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
300
+// now 00000000
301
+  for(j=0;j<8;j+=2){
302
+    for(i=0;i<1;i++){
303
+      for(k=0;k<INTS_PER_ROW;k++){
304
+        unsigned int t,b;
305
+        t=ri[INTS_PER_ROW*(j+i)+k];
306
+        b=ri[INTS_PER_ROW*(j+i+1)+k];
307
+        ri[INTS_PER_ROW*(j+i)+k]=     (t&0x00ff00ff)     | ((b&0x00ff00ff)<<8);
308
+        ri[INTS_PER_ROW*(j+i+1)+k]=  ((t&0xff00ff00)>>8) |  (b&0xff00ff00);
309
+      }
310
+    }
311
+  }
312
+//dump_mem("NE2 r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
313
+// now 01010101
314
+  for(j=0;j<8;j+=4){
315
+    for(i=0;i<2;i++){
316
+      for(k=0;k<INTS_PER_ROW;k++){
317
+        unsigned int t,b;
318
+        t=ri[INTS_PER_ROW*(j+i)+k];
319
+        b=ri[INTS_PER_ROW*(j+i+2)+k];
320
+        ri[INTS_PER_ROW*(j+i)+k]=     (t&0x0000ffff)      | ((b           )<<16);
321
+        ri[INTS_PER_ROW*(j+i+2)+k]=  ((t           )>>16) |  (b&0xffff0000) ;
322
+      }
323
+    }
324
+  }
325
+//dump_mem("NE3 r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
326
+// now 01230123
327
+  for(g=0;g<count;g++){
328
+    bdi[2*g]=ri[g];
329
+    bdi[2*g+1]=ri[GROUP_PARALLELISM+g];
330
+  }
331
+}
332
+
333
+//-----block main function
334
+
335
+// block group
336
+static void block_decypher_group(
337
+  batch *kkmulti,       // [In]  kkmulti[0]-kkmulti[55] 56 batches | Key schedule (each batch has repeated equal bytes).
338
+  unsigned char *ib,    // [In]  (ib0,ib1,...ib7)...x32 32*8 bytes | Initialization vector.
339
+  unsigned char *bd,    // [Out] (bd0,bd1,...bd7)...x32 32*8 bytes | Block decipher.
340
+  int count)
341
+{
342
+  // int is faster than unsigned char. apparently not
343
+  static const unsigned char block_sbox[0x100] = {
344
+    0x3A,0xEA,0x68,0xFE,0x33,0xE9,0x88,0x1A, 0x83,0xCF,0xE1,0x7F,0xBA,0xE2,0x38,0x12,
345
+    0xE8,0x27,0x61,0x95,0x0C,0x36,0xE5,0x70, 0xA2,0x06,0x82,0x7C,0x17,0xA3,0x26,0x49,
346
+    0xBE,0x7A,0x6D,0x47,0xC1,0x51,0x8F,0xF3, 0xCC,0x5B,0x67,0xBD,0xCD,0x18,0x08,0xC9,
347
+    0xFF,0x69,0xEF,0x03,0x4E,0x48,0x4A,0x84, 0x3F,0xB4,0x10,0x04,0xDC,0xF5,0x5C,0xC6,
348
+    0x16,0xAB,0xAC,0x4C,0xF1,0x6A,0x2F,0x3C, 0x3B,0xD4,0xD5,0x94,0xD0,0xC4,0x63,0x62,
349
+    0x71,0xA1,0xF9,0x4F,0x2E,0xAA,0xC5,0x56, 0xE3,0x39,0x93,0xCE,0x65,0x64,0xE4,0x58,
350
+    0x6C,0x19,0x42,0x79,0xDD,0xEE,0x96,0xF6, 0x8A,0xEC,0x1E,0x85,0x53,0x45,0xDE,0xBB,
351
+    0x7E,0x0A,0x9A,0x13,0x2A,0x9D,0xC2,0x5E, 0x5A,0x1F,0x32,0x35,0x9C,0xA8,0x73,0x30,
352
+
353
+    0x29,0x3D,0xE7,0x92,0x87,0x1B,0x2B,0x4B, 0xA5,0x57,0x97,0x40,0x15,0xE6,0xBC,0x0E,
354
+    0xEB,0xC3,0x34,0x2D,0xB8,0x44,0x25,0xA4, 0x1C,0xC7,0x23,0xED,0x90,0x6E,0x50,0x00,
355
+    0x99,0x9E,0x4D,0xD9,0xDA,0x8D,0x6F,0x5F, 0x3E,0xD7,0x21,0x74,0x86,0xDF,0x6B,0x05,
356
+    0x8E,0x5D,0x37,0x11,0xD2,0x28,0x75,0xD6, 0xA7,0x77,0x24,0xBF,0xF0,0xB0,0x02,0xB7,
357
+    0xF8,0xFC,0x81,0x09,0xB1,0x01,0x76,0x91, 0x7D,0x0F,0xC8,0xA0,0xF2,0xCB,0x78,0x60,
358
+    0xD1,0xF7,0xE0,0xB5,0x98,0x22,0xB3,0x20, 0x1D,0xA6,0xDB,0x7B,0x59,0x9F,0xAE,0x31,
359
+    0xFB,0xD3,0xB6,0xCA,0x43,0x72,0x07,0xF4, 0xD8,0x41,0x14,0x55,0x0D,0x54,0x8B,0xB9,
360
+    0xAD,0x46,0x0B,0xAF,0x80,0x52,0x2C,0xFA, 0x8C,0x89,0x66,0xFD,0xB2,0xA9,0x9B,0xC0,
361
+  };
362
+  MEMALIGN unsigned char r[GROUP_PARALLELISM*(8+56)];  /* 56 because we will move back in memory while looping */
363
+  MEMALIGN unsigned char sbox_in[GROUP_PARALLELISM],sbox_out[GROUP_PARALLELISM],perm_out[GROUP_PARALLELISM];
364
+  int roff;
365
+  int i,g,count_all=GROUP_PARALLELISM;
366
+
367
+  roff=GROUP_PARALLELISM*56;
368
+
369
+#define FASTTRASP1
370
+#ifndef FASTTRASP1
371
+  for(g=0;g<count;g++){
372
+    // Init registers 
373
+    int j;
374
+    for(j=0;j<8;j++){
375
+      r[roff+GROUP_PARALLELISM*j+g]=ib[8*g+j];
376
+    }
377
+  }
378
+#else
379
+  trasp_N_8((unsigned char *)&r[roff],(unsigned char *)ib,count);
380
+#endif
381
+//dump_mem("OLD r[roff]",&r[roff],GROUP_PARALLELISM*8,GROUP_PARALLELISM);
382
+
383
+  // loop over kk[55]..kk[0]
384
+  for(i=55;i>=0;i--){
385
+    {
386
+      MEMALIGN batch tkkmulti=kkmulti[i];
387
+      batch *si=(batch *)sbox_in;
388
+      batch *r6_N=(batch *)(r+roff+GROUP_PARALLELISM*6);
389
+      for(g=0;g<count_all/BYTES_PER_BATCH;g++){
390
+        si[g]=B_FFXOR(tkkmulti,r6_N[g]);              //FIXME: introduce FASTBATCH?
391
+      }
392
+    }
393
+
394
+    // table lookup, this works on only one byte at a time
395
+    // most difficult part of all
396
+    // - can't be parallelized
397
+    // - can't be synthetized through boolean terms (8 input bits are too many)
398
+    for(g=0;g<count_all;g++){
399
+      sbox_out[g]=block_sbox[sbox_in[g]];
400
+    }
401
+
402
+    // bit permutation
403
+    {
404
+      unsigned char *po=(unsigned char *)perm_out;
405
+      unsigned char *so=(unsigned char *)sbox_out;
406
+//dump_mem("pre perm ",(unsigned char *)so,GROUP_PARALLELISM,GROUP_PARALLELISM);
407
+      for(g=0;g<count_all;g+=BYTES_PER_BATCH){
408
+        MEMALIGN batch in,out;
409
+        in=*(batch *)&so[g];
410
+
411
+        out=B_FFOR(
412
+	    B_FFOR(
413
+	    B_FFOR(
414
+	    B_FFOR(
415
+	    B_FFOR(
416
+	           B_FFSH8L(B_FFAND(in,B_FFN_ALL_29()),1),
417
+	           B_FFSH8L(B_FFAND(in,B_FFN_ALL_02()),6)),
418
+	           B_FFSH8L(B_FFAND(in,B_FFN_ALL_04()),3)),
419
+	           B_FFSH8R(B_FFAND(in,B_FFN_ALL_10()),2)),
420
+	           B_FFSH8R(B_FFAND(in,B_FFN_ALL_40()),6)),
421
+	           B_FFSH8R(B_FFAND(in,B_FFN_ALL_80()),4));
422
+
423
+        *(batch *)&po[g]=out;
424
+      }
425
+//dump_mem("post perm",(unsigned char *)po,GROUP_PARALLELISM,GROUP_PARALLELISM);
426
+    }
427
+
428
+    roff-=GROUP_PARALLELISM; /* virtual shift of registers */
429
+
430
+#if 0
431
+/* one by one */
432
+    for(g=0;g<count_all;g++){
433
+      r[roff+GROUP_PARALLELISM*0+g]=r[roff+GROUP_PARALLELISM*8+g]^sbox_out[g];
434
+      r[roff+GROUP_PARALLELISM*6+g]^=perm_out[g];
435
+      r[roff+GROUP_PARALLELISM*4+g]^=r[roff+GROUP_PARALLELISM*0+g];
436
+      r[roff+GROUP_PARALLELISM*3+g]^=r[roff+GROUP_PARALLELISM*0+g];
437
+      r[roff+GROUP_PARALLELISM*2+g]^=r[roff+GROUP_PARALLELISM*0+g];
438
+    }
439
+#else
440
+    for(g=0;g<count_all;g+=BEST_SPAN){
441
+      XOR_BEST_BY(&r[roff+GROUP_PARALLELISM*0+g],&r[roff+GROUP_PARALLELISM*8+g],&sbox_out[g]);
442
+      XOREQ_BEST_BY(&r[roff+GROUP_PARALLELISM*6+g],&perm_out[g]);
443
+      XOREQ_BEST_BY(&r[roff+GROUP_PARALLELISM*4+g],&r[roff+GROUP_PARALLELISM*0+g]);
444
+      XOREQ_BEST_BY(&r[roff+GROUP_PARALLELISM*3+g],&r[roff+GROUP_PARALLELISM*0+g]);
445
+      XOREQ_BEST_BY(&r[roff+GROUP_PARALLELISM*2+g],&r[roff+GROUP_PARALLELISM*0+g]);
446
+    }
447
+#endif
448
+  }
449
+
450
+#define FASTTRASP2
451
+#ifndef FASTTRASP2
452
+  for(g=0;g<count;g++){
453
+    // Copy results
454
+    int j;
455
+    for(j=0;j<8;j++){
456
+      bd[8*g+j]=r[roff+GROUP_PARALLELISM*j+g];
457
+    }
458
+  }
459
+#else
460
+  trasp_8_N((unsigned char *)&r[roff],(unsigned char *)bd,count);
461
+#endif
462
+}
463
+
464
+//-----------------------------------EXTERNAL INTERFACE
465
+
466
+//-----get internal parallelism
467
+
468
+int get_internal_parallelism(void){
469
+  return GROUP_PARALLELISM;
470
+}
471
+
472
+//-----get suggested cluster size
473
+
474
+int get_suggested_cluster_size(void){
475
+  int r;
476
+  r=GROUP_PARALLELISM+GROUP_PARALLELISM/10;
477
+  if(r<GROUP_PARALLELISM+5) r=GROUP_PARALLELISM+5;
478
+  return r;
479
+}
480
+
481
+//-----key structure
482
+
483
+void *get_key_struct(void){
484
+  struct csa_keys_t *keys=(struct csa_keys_t *)MALLOC(sizeof(struct csa_keys_t));
485
+  if(keys) {
486
+    static const unsigned char pk[8] = { 0,0,0,0,0,0,0,0 };
487
+    set_control_words(keys,pk,pk);
488
+    }
489
+  return keys;
490
+}
491
+
492
+void free_key_struct(void *keys){
493
+  return FREE(keys);
494
+}
495
+
496
+//-----set control words
497
+
498
+static void schedule_key(struct csa_key_t *key, const unsigned char *pk){
499
+  // could be made faster, but is not run often
500
+  int bi,by;
501
+  int i,j;
502
+// key
503
+  memcpy(key->ck,pk,8);
504
+// precalculations for stream
505
+  key_schedule_stream(key->ck,key->iA,key->iB);
506
+  for(by=0;by<8;by++){
507
+    for(bi=0;bi<8;bi++){
508
+      key->ck_g[by][bi]=(key->ck[by]&(1<<bi))?FF1():FF0();
509
+    }
510
+  }
511
+  for(by=0;by<8;by++){
512
+    for(bi=0;bi<4;bi++){
513
+      key->iA_g[by][bi]=(key->iA[by]&(1<<bi))?FF1():FF0();
514
+      key->iB_g[by][bi]=(key->iB[by]&(1<<bi))?FF1():FF0();
515
+    }
516
+  }
517
+// precalculations for block
518
+  key_schedule_block(key->ck,key->kk);
519
+  for(i=0;i<56;i++){
520
+    for(j=0;j<BYTES_PER_BATCH;j++){
521
+      *(((unsigned char *)&key->kkmulti[i])+j)=key->kk[i];
522
+    }
523
+  }
524
+}
525
+
526
+void set_control_words(void *keys, const unsigned char *ev, const unsigned char *od){
527
+  schedule_key(&((struct csa_keys_t *)keys)->even,ev);
528
+  schedule_key(&((struct csa_keys_t *)keys)->odd,od);
529
+}
530
+
531
+void set_even_control_word(void *keys, const unsigned char *pk){
532
+  schedule_key(&((struct csa_keys_t *)keys)->even,pk);
533
+}
534
+
535
+void set_odd_control_word(void *keys, const unsigned char *pk){
536
+  schedule_key(&((struct csa_keys_t *)keys)->odd,pk);
537
+}
538
+
539
+//-----get control words
540
+
541
+void get_control_words(void *keys, unsigned char *even, unsigned char *odd){
542
+  memcpy(even,&((struct csa_keys_t *)keys)->even.ck,8);
543
+  memcpy(odd,&((struct csa_keys_t *)keys)->odd.ck,8);
544
+}
545
+
546
+//----- decrypt
547
+
548
+int decrypt_packets(void *keys, unsigned char **cluster){
549
+  // statistics, currently unused
550
+  int stat_no_scramble=0;
551
+  int stat_reserved=0;
552
+  int stat_decrypted[2]={0,0};
553
+  int stat_decrypted_mini=0;
554
+  unsigned char **clst;
555
+  unsigned char **clst2;
556
+  int grouped;
557
+  int group_ev_od;
558
+  int advanced;
559
+  int can_advance;
560
+  unsigned char *g_pkt[GROUP_PARALLELISM];
561
+  int g_len[GROUP_PARALLELISM];
562
+  int g_offset[GROUP_PARALLELISM];
563
+  int g_n[GROUP_PARALLELISM];
564
+  int g_residue[GROUP_PARALLELISM];
565
+  unsigned char *pkt;
566
+  int xc0,ev_od,len,offset,n,residue;
567
+  struct csa_key_t* k;
568
+  int i,j,iter,g;
569
+  int t23,tsmall;
570
+  int alive[24];
571
+//icc craziness  int pad1=0; //////////align! FIXME
572
+  unsigned char *encp[GROUP_PARALLELISM];
573
+  MEMALIGN unsigned char stream_in[GROUP_PARALLELISM*8];
574
+  MEMALIGN unsigned char stream_out[GROUP_PARALLELISM*8];
575
+  MEMALIGN unsigned char ib[GROUP_PARALLELISM*8];
576
+  MEMALIGN unsigned char block_out[GROUP_PARALLELISM*8];
577
+  struct stream_regs regs;
578
+
579
+//icc craziness  i=(int)&pad1;//////////align!!! FIXME
580
+
581
+  // build a list of packets to be processed
582
+  clst=cluster;
583
+  grouped=0;
584
+  advanced=0;
585
+  can_advance=1;
586
+  group_ev_od=-1; // silence incorrect compiler warning
587
+  pkt=*clst;
588
+  do{ // find a new packet
589
+    if(grouped==GROUP_PARALLELISM){
590
+      // full
591
+      break;
592
+    }
593
+    if(pkt==NULL){
594
+      // no more ranges
595
+      break;
596
+    }
597
+    if(pkt>=*(clst+1)){
598
+      // out of this range, try next
599
+      clst++;clst++;
600
+      pkt=*clst;
601
+      continue;
602
+    }
603
+
604
+    do{ // handle this packet
605
+      xc0=pkt[3]&0xc0;
606
+      DBG(fprintf(stderr,"   exam pkt=%p, xc0=%02x, can_adv=%i\n",pkt,xc0,can_advance));
607
+      if(xc0==0x00){
608
+        DBG(fprintf(stderr,"skip clear pkt %p (can_advance is %i)\n",pkt,can_advance));
609
+        advanced+=can_advance;
610
+        stat_no_scramble++;
611
+        break;
612
+      }
613
+      if(xc0==0x40){
614
+        DBG(fprintf(stderr,"skip reserved pkt %p (can_advance is %i)\n",pkt,can_advance));
615
+        advanced+=can_advance;
616
+        stat_reserved++;
617
+        break;
618
+      }
619
+      if(xc0==0x80||xc0==0xc0){ // encrypted
620
+        ev_od=(xc0&0x40)>>6; // 0 even, 1 odd
621
+        if(grouped==0) group_ev_od=ev_od; // this group will be all even (or odd)
622
+        if(group_ev_od==ev_od){ // could be added to group
623
+          pkt[3]&=0x3f;  // consider it decrypted now
624
+          if(pkt[3]&0x20){ // incomplete packet
625
+            offset=4+pkt[4]+1;
626
+            len=188-offset;
627
+            n=len>>3;
628
+            residue=len-(n<<3);
629
+            if(n==0){ // decrypted==encrypted!
630
+              DBG(fprintf(stderr,"DECRYPTED MINI! (can_advance is %i)\n",can_advance));
631
+              advanced+=can_advance;
632
+              stat_decrypted_mini++;
633
+              break; // this doesn't need more processing
634
+            }
635
+          }else{
636
+            len=184;
637
+            offset=4;
638
+            n=23;
639
+            residue=0;
640
+          }
641
+          g_pkt[grouped]=pkt;
642
+          g_len[grouped]=len;
643
+          g_offset[grouped]=offset;
644
+          g_n[grouped]=n;
645
+          g_residue[grouped]=residue;
646
+          DBG(fprintf(stderr,"%2i: eo=%i pkt=%p len=%03i n=%2i residue=%i\n",grouped,ev_od,pkt,len,n,residue));
647
+          grouped++;
648
+          advanced+=can_advance;
649
+          stat_decrypted[ev_od]++;
650
+        }
651
+        else{
652
+          can_advance=0;
653
+          DBG(fprintf(stderr,"skip pkt %p and can_advance set to 0\n",pkt));
654
+          break; // skip and go on
655
+        }
656
+      }
657
+    } while(0);
658
+
659
+    if(can_advance){
660
+      // move range start forward
661
+      *clst+=188;
662
+    }
663
+    // next packet, if there is one
664
+    pkt+=188;
665
+  } while(1);
666
+  DBG(fprintf(stderr,"-- result: grouped %i pkts, advanced %i pkts\n",grouped,advanced));
667
+
668
+  // delete empty ranges and compact list
669
+  clst2=cluster;
670
+  for(clst=cluster;*clst!=NULL;clst+=2){
671
+    // if not empty
672
+    if(*clst<*(clst+1)){
673
+      // it will remain 
674
+      *clst2=*clst;
675
+      *(clst2+1)=*(clst+1);
676
+      clst2+=2;
677
+    }
678
+  }
679
+  *clst2=NULL;
680
+
681
+  if(grouped==0){
682
+    // no processing needed
683
+    return advanced;
684
+  }
685
+
686
+  //  sort them, longest payload first
687
+  //  we expect many n=23 packets and a few n<23
688
+  DBG(fprintf(stderr,"PRESORTING\n"));
689
+  for(i=0;i<grouped;i++){
690
+    DBG(fprintf(stderr,"%2i of %2i: pkt=%p len=%03i n=%2i residue=%i\n",i,grouped,g_pkt[i],g_len[i],g_n[i],g_residue[i]));
691
+    }
692
+  // grouped is always <= GROUP_PARALLELISM
693
+
694
+#define g_swap(a,b) \
695
+    pkt=g_pkt[a]; \
696
+    g_pkt[a]=g_pkt[b]; \
697
+    g_pkt[b]=pkt; \
698
+\
699
+    len=g_len[a]; \
700
+    g_len[a]=g_len[b]; \
701
+    g_len[b]=len; \
702
+\
703
+    offset=g_offset[a]; \
704
+    g_offset[a]=g_offset[b]; \
705
+    g_offset[b]=offset; \
706
+\
707
+    n=g_n[a]; \
708
+    g_n[a]=g_n[b]; \
709
+    g_n[b]=n; \
710
+\
711
+    residue=g_residue[a]; \
712
+    g_residue[a]=g_residue[b]; \
713
+    g_residue[b]=residue;
714
+
715
+  // step 1: move n=23 packets before small packets
716
+  t23=0;
717
+  tsmall=grouped-1;
718
+  for(;;){
719
+    for(;t23<grouped;t23++){
720
+      if(g_n[t23]!=23) break;
721
+    }
722
+DBG(fprintf(stderr,"t23 after for =%i\n",t23));
723
+    
724
+    for(;tsmall>=0;tsmall--){
725
+      if(g_n[tsmall]==23) break;
726
+    }
727
+DBG(fprintf(stderr,"tsmall after for =%i\n",tsmall));
728
+    
729
+    if(tsmall-t23<1) break;
730
+    
731
+DBG(fprintf(stderr,"swap t23=%i,tsmall=%i\n",t23,tsmall));
732
+
733
+    g_swap(t23,tsmall);
734
+
735
+    t23++;
736
+    tsmall--;
737
+DBG(fprintf(stderr,"new t23=%i,tsmall=%i\n\n",t23,tsmall));
738
+  }
739
+  DBG(fprintf(stderr,"packets with n=23, t23=%i   grouped=%i\n",t23,grouped));
740
+  DBG(fprintf(stderr,"MIDSORTING\n"));
741
+  for(i=0;i<grouped;i++){
742
+    DBG(fprintf(stderr,"%2i of %2i: pkt=%p len=%03i n=%2i residue=%i\n",i,grouped,g_pkt[i],g_len[i],g_n[i],g_residue[i]));
743
+    }
744
+
745
+  // step 2: sort small packets in decreasing order of n (bubble sort is enough)
746
+  for(i=t23;i<grouped;i++){
747
+    for(j=i+1;j<grouped;j++){
748
+      if(g_n[j]>g_n[i]){
749
+        g_swap(i,j);
750
+      }
751
+    }
752
+  }
753
+  DBG(fprintf(stderr,"POSTSORTING\n"));
754
+  for(i=0;i<grouped;i++){
755
+    DBG(fprintf(stderr,"%2i of %2i: pkt=%p len=%03i n=%2i residue=%i\n",i,grouped,g_pkt[i],g_len[i],g_n[i],g_residue[i]));
756
+    }
757
+
758
+  // we need to know how many packets need 23 iterations, how many 22...
759
+  for(i=0;i<=23;i++){
760
+    alive[i]=0;
761
+  }
762
+  // count
763
+  alive[23-1]=t23;
764
+  for(i=t23;i<grouped;i++){
765
+    alive[g_n[i]-1]++;
766
+  }
767
+  // integrate
768
+  for(i=22;i>=0;i--){
769
+    alive[i]+=alive[i+1];
770
+  }
771
+  DBG(fprintf(stderr,"ALIVE\n"));
772
+  for(i=0;i<=23;i++){
773
+    DBG(fprintf(stderr,"alive%2i=%i\n",i,alive[i]));
774
+    }
775
+
776
+  // choose key
777
+  if(group_ev_od==0){
778
+    k=&((struct csa_keys_t *)keys)->even;
779
+  }
780
+  else{
781
+    k=&((struct csa_keys_t *)keys)->odd;
782
+  }
783
+
784
+  //INIT
785
+//#define INITIALIZE_UNUSED_INPUT
786
+#ifdef INITIALIZE_UNUSED_INPUT
787
+// unnecessary zeroing.
788
+// without this, we operate on uninitialized memory
789
+// when grouped<GROUP_PARALLELISM, but it's not a problem,
790
+// as final results will be discarded.
791
+// random data makes debugging sessions difficult.
792
+  for(j=0;j<GROUP_PARALLELISM*8;j++) stream_in[j]=0;
793
+DBG(fprintf(stderr,"--- WARNING: you could gain speed by not initializing unused memory ---\n"));
794
+#else
795
+DBG(fprintf(stderr,"--- WARNING: DEBUGGING IS MORE DIFFICULT WHEN PROCESSING RANDOM DATA CHANGING AT EVERY RUN! ---\n"));
796
+#endif
797
+
798
+  for(g=0;g<grouped;g++){
799
+    encp[g]=g_pkt[g];
800
+    DBG(fprintf(stderr,"header[%i]=%p (%02x)\n",g,encp[g],*(encp[g])));
801
+    encp[g]+=g_offset[g]; // skip header
802
+    FFTABLEIN(stream_in,g,encp[g]);
803
+  }
804
+//dump_mem("stream_in",stream_in,GROUP_PARALLELISM*8,BYPG);
805
+
806
+
807
+  // ITER 0
808
+DBG(fprintf(stderr,">>>>>ITER 0\n"));
809
+  iter=0;
810
+  stream_cypher_group_init(&regs,k->iA_g,k->iB_g,stream_in);
811
+  // fill first ib
812
+  for(g=0;g<alive[iter];g++){
813
+    COPY_8_BY(ib+8*g,encp[g]);
814
+  }
815
+DBG(dump_mem("IB ",ib,8*alive[iter],8));
816
+  // ITER 1..N-1
817
+  for (iter=1;iter<23&&alive[iter-1]>0;iter++){
818
+DBG(fprintf(stderr,">>>>>ITER %i\n",iter));
819
+    // alive and just dead packets: calc block
820
+    block_decypher_group(k->kkmulti,ib,block_out,alive[iter-1]);
821
+DBG(dump_mem("BLO_ib ",block_out,8*alive[iter-1],8));
822
+    // all packets (dead too): calc stream
823
+    stream_cypher_group_normal(&regs,stream_out);
824
+//dump_mem("stream_out",stream_out,GROUP_PARALLELISM*8,BYPG);
825
+
826
+    // alive packets: calc ib
827
+    for(g=0;g<alive[iter];g++){
828
+      FFTABLEOUT(ib+8*g,stream_out,g);
829
+DBG(dump_mem("stream_out_ib ",ib+8*g,8,8));
830
+// XOREQ8BY gcc bug? 2x4 ok, 8 ko    UPDATE: result ok but speed 1-2% slower (!!!???)
831
+#if 1
832
+      XOREQ_4_BY(ib+8*g,encp[g]+8);
833
+      XOREQ_4_BY(ib+8*g+4,encp[g]+8+4);
834
+#else
835
+      XOREQ_8_BY(ib+8*g,encp[g]+8);
836
+#endif
837
+DBG(dump_mem("after_stream_xor_ib ",ib+8*g,8,8));
838
+    }
839
+    // alive packets: decrypt data
840
+    for(g=0;g<alive[iter];g++){
841
+DBG(dump_mem("before_ib_decrypt_data ",encp[g],8,8));
842
+      XOR_8_BY(encp[g],ib+8*g,block_out+8*g);
843
+DBG(dump_mem("after_ib_decrypt_data ",encp[g],8,8));
844
+    }
845
+    // just dead packets: write decrypted data
846
+    for(g=alive[iter];g<alive[iter-1];g++){
847
+DBG(dump_mem("jd_before_ib_decrypt_data ",encp[g],8,8));
848
+      COPY_8_BY(encp[g],block_out+8*g);
849
+DBG(dump_mem("jd_after_ib_decrypt_data ",encp[g],8,8));
850
+    }
851
+    // just dead packets: decrypt residue
852
+    for(g=alive[iter];g<alive[iter-1];g++){
853
+DBG(dump_mem("jd_before_decrypt_residue ",encp[g]+8,g_residue[g],g_residue[g]));
854
+      FFTABLEOUTXORNBY(g_residue[g],encp[g]+8,stream_out,g);
855
+DBG(dump_mem("jd_after_decrypt_residue ",encp[g]+8,g_residue[g],g_residue[g]));
856
+    }
857
+    // alive packets: pointers++
858
+    for(g=0;g<alive[iter];g++) encp[g]+=8;
859
+  };
860
+  // ITER N
861
+DBG(fprintf(stderr,">>>>>ITER 23\n"));
862
+  iter=23;
863
+  // calc block
864
+  block_decypher_group(k->kkmulti,ib,block_out,alive[iter-1]);
865
+DBG(dump_mem("23BLO_ib ",block_out,8*alive[iter-1],8));
866
+  // just dead packets: write decrypted data
867
+  for(g=alive[iter];g<alive[iter-1];g++){
868
+DBG(dump_mem("23jd_before_ib_decrypt_data ",encp[g],8,8));
869
+    COPY_8_BY(encp[g],block_out+8*g);
870
+DBG(dump_mem("23jd_after_ib_decrypt_data ",encp[g],8,8));
871
+  }
872
+  // no residue possible
873
+  // so do nothing
874
+
875
+  DBG(fprintf(stderr,"returning advanced=%i\n",advanced));
876
+
877
+  M_EMPTY(); // restore CPU multimedia state
878
+
879
+  return advanced;
880
+}

+ 62
- 0
FFdecsa/FFdecsa.h View File

@@ -0,0 +1,62 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+#ifndef FFDECSA_H
22
+#define FFDECSA_H
23
+
24
+//----- public interface
25
+
26
+// -- how many packets can be decrypted at the same time
27
+// This is an info about internal decryption parallelism.
28
+// You should try to call decrypt_packets with more packets than the number
29
+// returned here for performance reasons (use get_suggested_cluster_size to know
30
+// how many).
31
+int get_internal_parallelism(void);
32
+
33
+// -- how many packets you should have in a cluster when calling decrypt_packets
34
+// This is a suggestion to achieve optimal performance; typically a little
35
+// higher than what get_internal_parallelism returns.
36
+// Passing less packets could slow down the decryption.
37
+// Passing more packets is never bad (if you don't spend a lot of time building
38
+// the list).
39
+int get_suggested_cluster_size(void);
40
+
41
+// -- alloc & free the key structure
42
+void *get_key_struct(void);
43
+void free_key_struct(void *keys);
44
+
45
+// -- set control words, 8 bytes each
46
+void set_control_words(void *keys, const unsigned char *even, const unsigned char *odd);
47
+
48
+// -- set even control word, 8 bytes
49
+void set_even_control_word(void *keys, const unsigned char *even);
50
+
51
+// -- set odd control word, 8 bytes
52
+void set_odd_control_word(void *keys, const unsigned char *odd);
53
+
54
+// -- get control words, 8 bytes each
55
+//void get_control_words(void *keys, unsigned char *even, unsigned char *odd);
56
+
57
+// -- decrypt many TS packets
58
+// This interface is a bit complicated because it is designed for maximum speed.
59
+// Please read doc/how_to_use.txt.
60
+int decrypt_packets(void *keys, unsigned char **cluster);
61
+
62
+#endif

+ 176
- 0
FFdecsa/FFdecsa_test.c View File

@@ -0,0 +1,176 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+#include <string.h>
22
+#include <stdio.h>
23
+#include <sys/time.h>
24
+
25
+#include "FFdecsa.h"
26
+
27
+#ifndef NULL
28
+#define NULL 0
29
+#endif
30
+
31
+#include "FFdecsa_test_testcases.h"
32
+
33
+int compare(unsigned char *p1, unsigned char *p2, int n, int silently){
34
+  int i;
35
+  int ok=1;
36
+  for(i=0;i<n;i++){
37
+    if(i==3) continue; // tolerate this
38
+    if(p1[i]!=p2[i]){
39
+//      fprintf(stderr,"at pos 0x%02x, got 0x%02x instead of 0x%02x\n",i,p1[i],p2[i]);
40
+      ok=0;
41
+    }
42
+  }
43
+  if(!silently){
44
+    if(ok){
45
+       fprintf(stderr,"CORRECT!\n");
46
+    }
47
+    else{
48
+       fprintf(stderr,"FAILED!\n");
49
+    }
50
+  }
51
+  return ok;
52
+}
53
+
54
+
55
+//MAIN
56
+
57
+#define TS_PKTS_FOR_TEST 30*1000
58
+//#define TS_PKTS_FOR_TEST 1000*1000
59
+unsigned char megabuf[188*TS_PKTS_FOR_TEST];
60
+unsigned char onebuf[188];
61
+
62
+unsigned char *cluster[10];
63
+
64
+int main(void){
65
+  int i;
66
+  struct timeval tvs,tve;
67
+  void *keys=get_key_struct();
68
+  int ok=1;
69
+
70
+  fprintf(stderr,"FFdecsa 1.0: testing correctness and speed\n");
71
+
72
+/* begin correctness testing */
73
+
74
+  set_control_words(keys,test_invalid_key,test_1_key);
75
+  memcpy(onebuf,test_1_encrypted,188);
76
+  cluster[0]=onebuf;cluster[1]=onebuf+188;cluster[2]=NULL;
77
+  decrypt_packets(keys,cluster);
78
+  ok*=compare(onebuf,test_1_expected,188,0);
79
+
80
+  set_control_words(keys,test_2_key,test_invalid_key);
81
+  memcpy(onebuf,test_2_encrypted,188);
82
+  cluster[0]=onebuf;cluster[1]=onebuf+188;cluster[2]=NULL;
83
+  decrypt_packets(keys,cluster);
84
+  ok*=compare(onebuf,test_2_expected,188,0);
85
+
86
+  set_control_words(keys,test_3_key,test_invalid_key);
87
+  memcpy(onebuf,test_3_encrypted,188);
88
+  cluster[0]=onebuf;cluster[1]=onebuf+188;cluster[2]=NULL;
89
+  decrypt_packets(keys,cluster);
90
+  ok*=compare(onebuf,test_3_expected,188,0);
91
+
92
+  set_control_words(keys,test_p_10_0_key,test_invalid_key);
93
+  memcpy(onebuf,test_p_10_0_encrypted,188);
94
+  cluster[0]=onebuf;cluster[1]=onebuf+188;cluster[2]=NULL;
95
+  decrypt_packets(keys,cluster);
96
+  ok*=compare(onebuf,test_p_10_0_expected,188,0);
97
+
98
+  set_control_words(keys,test_p_1_6_key,test_invalid_key);
99
+  memcpy(onebuf,test_p_1_6_encrypted,188);
100
+  cluster[0]=onebuf;cluster[1]=onebuf+188;cluster[2]=NULL;
101
+  decrypt_packets(keys,cluster);
102
+  ok*=compare(onebuf,test_p_1_6_expected,188,0);
103
+
104
+/* begin speed testing */
105
+
106
+#if 0
107
+// test on short packets
108
+#define s_encrypted test_p_1_6_encrypted
109
+#define s_key_e     test_p_1_6_key
110
+#define s_key_o     test_invalid_key
111
+#define s_expected  test_p_1_6_expected
112
+
113
+#else
114
+//test on full packets
115
+#define s_encrypted test_2_encrypted
116
+#define s_key_e     test_2_key
117
+#define s_key_o     test_invalid_key
118
+#define s_expected  test_2_expected
119
+
120
+#endif
121
+
122
+  for(i=0;i<TS_PKTS_FOR_TEST;i++){
123
+    memcpy(&megabuf[188*i],s_encrypted,188);
124
+  }
125
+// test that packets are not shuffled around
126
+// so, let's put an undecryptable packet somewhere in the middle (we will use a wrong key)
127
+#define noONE_POISONED_PACKET
128
+#ifdef ONE_POISONED_PACKET
129
+  memcpy(&megabuf[188*(TS_PKTS_FOR_TEST*2/3)],test_3_encrypted,188);
130
+#endif
131
+
132
+  // start decryption
133
+  set_control_words(keys,s_key_e,s_key_o);
134
+  gettimeofday(&tvs,NULL);
135
+#if 0
136
+// force one by one
137
+  for(i=0;i<TS_PKTS_FOR_TEST;i++){
138
+    cluster[0]=megabuf+188*i;cluster[1]=onebuf+188*i+188;cluster[2]=NULL;
139
+    decrypt_packets(keys,cluster);
140
+  }
141
+#else
142
+  {
143
+    int done=0;
144
+    while(done<TS_PKTS_FOR_TEST){
145
+      //fprintf(stderr,"done=%i\n",done);
146
+      cluster[0]=megabuf+188*done;cluster[1]=megabuf+188*TS_PKTS_FOR_TEST;cluster[2]=NULL;
147
+      done+=decrypt_packets(keys,cluster);
148
+    }
149
+  }
150
+#endif
151
+  gettimeofday(&tve,NULL);
152
+  //end decryption
153
+
154
+  fprintf(stderr,"speed=%f Mbit/s\n",(184*TS_PKTS_FOR_TEST*8)/((tve.tv_sec-tvs.tv_sec)+1e-6*(tve.tv_usec-tvs.tv_usec))/1000000);
155
+  fprintf(stderr,"speed=%f pkts/s\n",TS_PKTS_FOR_TEST/((tve.tv_sec-tvs.tv_sec)+1e-6*(tve.tv_usec-tvs.tv_usec)));
156
+
157
+  // this packet couldn't be decrypted correctly
158
+#ifdef ONE_POISONED_PACKET
159
+  compare(megabuf+188*(TS_PKTS_FOR_TEST*2/3),test_3_expected,188,0); /* will fail because we used a wrong key */
160
+#endif
161
+  // these should be ok
162
+  ok*=compare(megabuf,s_expected,188,0);
163
+  ok*=compare(megabuf+188*511,s_expected,188,0);
164
+  ok*=compare(megabuf+188*512,s_expected,188,0);
165
+  ok*=compare(megabuf+188*319,s_expected,188,0);
166
+  ok*=compare(megabuf+188*(TS_PKTS_FOR_TEST-1),s_expected,188,0);
167
+
168
+  for(i=0;i<TS_PKTS_FOR_TEST;i++){
169
+    if(!compare(megabuf+188*i,s_expected,188,1)){
170
+      fprintf(stderr,"FAILED COMPARISON OF PACKET %10i\n",i);
171
+      ok=0;
172
+    };
173
+  }
174
+
175
+  return ok ? 0 : 10;
176
+}

+ 279
- 0
FFdecsa/FFdecsa_test_testcases.h View File

@@ -0,0 +1,279 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+// TEST DATA
22
+
23
+////////// used as a wrong key
24
+unsigned char test_invalid_key[0x08] = {
25
+    0x0f, 0x1e, 0x2d, 0x3c, 0x4b, 0x5a, 0x69, 0x78
26
+};
27
+
28
+
29
+////////// test 1: odd key
30
+unsigned char test_1_key[0x8] = {
31
+    0x07, 0xe0, 0x1b, 0x02, 0xc9, 0xe0, 0x45, 0xee
32
+};
33
+unsigned char test_1_encrypted[0x100] = {
34
+    0x47, 0x00, 0x00, 0xd0,
35
+    0xde, 0xcf, 0x0a, 0x0d, 0xb2, 0xd7, 0xc4, 0x40, 0xde, 0x5d, 0x63, 0x18, 0x5a, 0x98, 0x17, 0xaa,
36
+    0xc9, 0xbc, 0x27, 0xc6, 0xcb, 0x49, 0x40, 0x48, 0xfd, 0x20, 0xb7, 0x05, 0x5b, 0x27, 0xcb, 0xeb,
37
+    0x9a, 0xf0, 0xac, 0x45, 0x6d, 0x56, 0xf4, 0x7b, 0x6f, 0xa0, 0x57, 0xf3, 0x9b, 0xf7, 0xa2, 0xc7,
38
+    0xd4, 0x68, 0x24, 0x00, 0x2f, 0x28, 0x13, 0x96, 0x94, 0xa8, 0x7c, 0xf4, 0x6f, 0x07, 0x2a, 0x0e,
39
+    0xe8, 0xa1, 0xeb, 0xc7, 0x80, 0xac, 0x1f, 0x79, 0xbf, 0x5d, 0xb6, 0x10, 0x7c, 0x2e, 0x52, 0xe9,
40
+    0x34, 0x2c, 0xa8, 0x39, 0x01, 0x73, 0x04, 0x24, 0xa8, 0x1e, 0xdb, 0x5b, 0xcb, 0x24, 0xf6, 0x31,
41
+    0xab, 0x02, 0x6b, 0xf9, 0xf6, 0xf7, 0xe9, 0x52, 0xad, 0xcf, 0x62, 0x0f, 0x42, 0xf6, 0x66, 0x5d,
42
+    0xc0, 0x86, 0xf2, 0x7b, 0x40, 0x20, 0xa9, 0xbd, 0x1f, 0xfd, 0x16, 0xad, 0x2e, 0x75, 0xa6, 0xa0,
43
+    0x85, 0xf3, 0x9c, 0x31, 0x20, 0x4e, 0xfb, 0x95, 0x61, 0x78, 0xce, 0x10, 0xc1, 0x48, 0x5f, 0xd3,
44
+    0x61, 0x05, 0x12, 0xf4, 0xe2, 0x04, 0xae, 0xe0, 0x86, 0x01, 0x56, 0x55, 0xb1, 0x0f, 0xa6, 0x33,
45
+    0x95, 0x20, 0x92, 0xf0, 0xbe, 0x39, 0x31, 0xe1, 0x2a, 0xf7, 0x93, 0xb4, 0xf7, 0xe4, 0xf1, 0x85,
46
+    0xae, 0x50, 0xf1, 0x63, 0xd4, 0x5d, 0x9c, 0x6c
47
+};
48
+unsigned char test_1_expected[0x100] = {
49
+    0x47, 0x00, 0x00, 0xd0,
50
+    0xaf, 0xbe, 0xfb, 0xef, 0xbe, 0xfb, 0xef, 0xbe, 0xfb, 0xef, 0xbe, 0xfb, 0xe6, 0xb5, 0xad, 0x7c,
51
+    0xf9, 0xf3, 0xe5, 0xb1, 0x6c, 0x7c, 0xf9, 0xf3, 0xe6, 0xb5, 0xad, 0x6b, 0x5f, 0x3e, 0x7c, 0xf9,
52
+    0x6c, 0x5b, 0x1f, 0x3e, 0x7c, 0xf9, 0xad, 0x6b, 0x5a, 0xd7, 0xcf, 0x9f, 0x3e, 0x5b, 0x16, 0xc7,
53
+    0xcf, 0x9f, 0x3e, 0x6b, 0x5a, 0xd6, 0xb5, 0xf3, 0xe7, 0xcf, 0x96, 0xc5, 0xb1, 0xf3, 0xe7, 0xcf,
54
+    0x9a, 0xd6, 0xb5, 0xad, 0x7c, 0xf9, 0xf3, 0xe5, 0xb1, 0x6c, 0x7c, 0xf9, 0xf3, 0xe6, 0xb5, 0xad,
55
+    0x6b, 0x5f, 0x3e, 0x7c, 0xf9, 0x6c, 0x5b, 0x1f, 0x3e, 0x7c, 0xf9, 0xad, 0x6b, 0x5a, 0xd7, 0xcf,
56
+    0x9f, 0x3e, 0x5b, 0x16, 0xc7, 0xcf, 0x9f, 0x3e, 0x6b, 0x5a, 0xd6, 0xb5, 0xf3, 0xe7, 0xcf, 0x96,
57
+    0xc5, 0xb1, 0xf3, 0xe7, 0xcf, 0x9a, 0xd6, 0xb5, 0xad, 0x7c, 0xf9, 0xf3, 0xe5, 0xb1, 0x6c, 0x7c,
58
+    0xf9, 0xf3, 0xe6, 0xb5, 0xad, 0x6b, 0x5f, 0x3e, 0x7c, 0xf9, 0x6c, 0x5b, 0x1f, 0x3e, 0x7c, 0xf9,
59
+    0xad, 0x6b, 0x5a, 0xd7, 0xcf, 0x9f, 0x3e, 0x5b, 0x16, 0xc7, 0xcf, 0x9f, 0x3e, 0x6b, 0x5a, 0xd6,
60
+    0xb5, 0xf3, 0xe7, 0xcf, 0x96, 0xc5, 0xb1, 0xf3, 0xe7, 0xcf, 0x9a, 0xd0, 0x00, 0x00, 0x00, 0x00,
61
+    0xff, 0xfc, 0x44, 0x00, 0x66, 0xb1, 0x11, 0x11
62
+};
63
+unsigned char test_1_expected_stream[0x100] = {
64
+    0xdc, 0x15, 0xde, 0xf1, 0x4a, 0xf1, 0xf8, 0x2c,
65
+    0x75, 0xc8, 0x3a, 0x1f, 0xbf, 0x67, 0x19, 0xe1,
66
+    0xf4, 0x6c, 0x78, 0x99, 0x48, 0xaf, 0xef, 0x94,
67
+    0x71, 0x6b, 0x23, 0x9e, 0x29, 0x69, 0x2d, 0xa1,
68
+    0x8a, 0xbb, 0xf4, 0x16, 0x68, 0xa5, 0x7f, 0x14,
69
+    0xa9, 0x37, 0x24, 0x05, 0x5e, 0xdd, 0xec, 0x4b,
70
+    0xb5, 0xcb, 0x7f, 0x1d, 0xa7, 0x09, 0x2a, 0xce,
71
+    0xc4, 0x30, 0x83, 0xfd, 0xd9, 0x88, 0xa9, 0xf3,
72
+    0x85, 0x9c, 0x38, 0x31, 0x88, 0xac, 0x74, 0x02,
73
+    0x44, 0xdc, 0xb7, 0x81, 0x07, 0xc8, 0x1b, 0x03,
74
+    0x9c, 0x76, 0xbe, 0xe9, 0x4d, 0x3e, 0x19, 0xad,
75
+    0xe1, 0xf1, 0xa5, 0x13, 0xe8, 0xc0, 0x12, 0x57,
76
+    0x68, 0xb1, 0x9c, 0x6c, 0x9f, 0x58, 0x78, 0xee,
77
+    0x4f, 0x5b, 0x33, 0x1e, 0xc6, 0x29, 0xfc, 0x40,
78
+    0x58, 0x22, 0xa2, 0xd8, 0x32, 0xdd, 0x29, 0x4f,
79
+    0x2b, 0xe1, 0xef, 0xe4, 0xbb, 0xf2, 0x60, 0x94,
80
+    0x6c, 0xc5, 0x51, 0xec, 0x35, 0x4c, 0x27, 0xc6,
81
+    0x9d, 0x73, 0xe0, 0xf4, 0x2b, 0xfa, 0x62, 0x12,
82
+    0xcd, 0x44, 0xbe, 0x57, 0xfe, 0x80, 0xe7, 0xa9,
83
+    0x3c, 0x49, 0x42, 0xb6, 0xed, 0x05, 0x57, 0x00,
84
+    0xd2, 0x25, 0x90, 0xb3, 0xe4, 0x65, 0x8f, 0xd6,
85
+    0x4e, 0x0c, 0x73, 0x30, 0x3b, 0x68, 0x48, 0xdd,
86
+// stream ^ sb
87
+//    0x02, 0x48, 0xbd, 0xe9, 0x10, 0x69, 0xef, 0x86,
88
+//    0xbc, 0x74, 0x1d, 0xd9, 0x74, 0x2e, 0x59, 0xa9,
89
+//    0x09, 0x4c, 0xcf, 0x9c, 0x13, 0x88, 0x24, 0x7f,
90
+//    0xeb, 0x9b, 0x8f, 0xdb, 0x44, 0x3f, 0xd9, 0xda,
91
+};
92
+unsigned char test_1_expected_block[0x100] = {
93
+    0xad, 0xf6, 0x46, 0x06, 0xae, 0x92, 0x00, 0x38,
94
+    0x47, 0x9b, 0xa3, 0x22, 0x92, 0x9b, 0xf4, 0xd5,
95
+    0xf0, 0xbf, 0x2a, 0x2d, 0x7f, 0xf4, 0xdd, 0x8c,
96
+    0x0d, 0x2e, 0x22, 0xb0, 0x1b, 0x01, 0xa5, 0x23,
97
+    0x89, 0x40, 0xbc, 0xdb, 0x8f, 0xab, 0x70, 0xb8,
98
+    0x27, 0x88, 0xcf, 0x9a, 0x4f, 0xae, 0xe9, 0x1a,
99
+    0xee, 0xfc, 0x3d, 0x82, 0x92, 0xd8, 0xb5, 0x33,
100
+    0xcb, 0x5e, 0xfe, 0xff, 0xe8, 0xd7, 0x51, 0x45,
101
+    0xa0, 0x17, 0x3b, 0x8c, 0x88, 0x7b, 0xd5, 0x0e,
102
+    0xc1, 0x9c, 0x63, 0x41, 0xf5, 0x5d, 0xaa, 0x8a,
103
+    0x5f, 0x37, 0x5b, 0xce, 0x7f, 0x76, 0xb4, 0x83,
104
+    0x74, 0x8f, 0x37, 0x47, 0x75, 0x6d, 0x2c, 0xca,
105
+    0x5a, 0x40, 0xa5, 0x75, 0x1a, 0x61, 0x81, 0x8d,
106
+    0xe4, 0x87, 0x17, 0xd0, 0x75, 0xee, 0x9a, 0x6b,
107
+    0x82, 0x6e, 0x47, 0x92, 0xd3, 0x32, 0x59, 0x5a,
108
+    0x03, 0x6e, 0x8a, 0x26, 0x7e, 0x0d, 0xf7, 0x7d,
109
+    0xf4, 0x4e, 0x79, 0x49, 0x59, 0x6f, 0x27, 0x2b,
110
+    0x80, 0x8f, 0x9e, 0x5b, 0xd6, 0xc0, 0xb0, 0x0b,
111
+    0xe6, 0x2e, 0xb2, 0xd5, 0x80, 0x10, 0x7f, 0xc1,
112
+    0xbf, 0xae, 0x1f, 0xd9, 0x6d, 0x57, 0x3c, 0x37,
113
+    0x4d, 0x21, 0xe4, 0xc8, 0x85, 0x44, 0xcf, 0xa0,
114
+    0x07, 0x93, 0x18, 0x83, 0xef, 0x35, 0xd4, 0xb1,
115
+    0xff, 0xfc, 0x44, 0x00, 0x66, 0xb1, 0x11, 0x11
116
+};
117
+unsigned char test_1_expected_kb[] = {
118
+    0xEE, 0x45, 0xE0, 0xC9, 0x02, 0x1B, 0xE0, 0x07,
119
+    0x46, 0xA4, 0x1C, 0x26, 0x7B, 0x0C, 0x01, 0xED,
120
+    0x93, 0x99, 0xC3, 0x14, 0xC4, 0x4A, 0x8D, 0x54,
121
+    0x19, 0x82, 0x39, 0xD1, 0x33, 0xB0, 0x33, 0x52,
122
+    0x75, 0x62, 0x80, 0x3A, 0xC8, 0x83, 0x5E, 0x23,
123
+    0xA2, 0x57, 0x0C, 0xC4, 0x2C, 0x2D, 0xD2, 0x98,
124
+    0xA0, 0x6C, 0x77, 0x29, 0x11, 0x42, 0x49, 0xCE,
125
+};
126
+unsigned char test_1_expected_kk[] = {
127
+    0x5e, 0x9d, 0xff, 0x2e, 0xbb, 0xaa, 0xa8, 0xe9,
128
+    0xf6, 0x0e, 0xff, 0x7c, 0xda, 0xce, 0x55, 0x03,
129
+    0xd9, 0xde, 0x79, 0xf5, 0x2c, 0xaf, 0x06, 0xf8,
130
+    0xb2, 0xc9, 0xf8, 0x78, 0x54, 0xf9, 0xd1, 0xe7,
131
+    0xeb, 0xbe, 0xd7, 0xeb, 0x25, 0xe9, 0x17, 0x99,
132
+    0xbf, 0x24, 0xce, 0x2a, 0x73, 0xfe, 0xf9, 0xbc,
133
+    0xd9, 0x55, 0x91, 0xcf, 0xe0, 0xc9, 0xdf, 0x88,
134
+};
135
+
136
+
137
+////////// test 2: even key
138
+unsigned char test_2_key[0x8] = {
139
+    0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00
140
+};
141
+unsigned char test_2_encrypted[0x100] = {
142
+    0x47, 0x00, 0x00, 0x90,
143
+    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
144
+    0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f,
145
+    0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f,
146
+    0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f,
147
+    0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f,
148
+    0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f,
149
+    0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f,
150
+    0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f,
151
+    0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,
152
+    0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9a, 0x9b, 0x9c, 0x9d, 0x9e, 0x9f,
153
+    0xa0, 0xa1, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, 0xa7, 0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf,
154
+    0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7,
155
+};
156
+unsigned char test_2_expected[0x100] = {
157
+    0x47, 0x00, 0x00, 0x90,
158
+    0x2d, 0x0a, 0x47, 0x20, 0x18, 0x11, 0x9c, 0x8a, 0xd1, 0x2a, 0x65, 0x6b, 0x89, 0xe4, 0x35, 0x2b,
159
+    0xc2, 0xb5, 0x90, 0x61, 0xd1, 0x7e, 0x02, 0xe1, 0x3f, 0x46, 0x70, 0xcf, 0x77, 0x91, 0x2f, 0x22,
160
+    0x93, 0xc1, 0x6c, 0xfe, 0x49, 0xad, 0x7c, 0xc2, 0xaf, 0x86, 0x1b, 0xa3, 0x29, 0xbe, 0xaa, 0x64,
161
+    0xf0, 0x22, 0xb9, 0x5e, 0x98, 0xaa, 0x60, 0xef, 0xdf, 0xd6, 0x44, 0x77, 0xe6, 0xbf, 0xbb, 0x94,
162
+    0xb2, 0x0a, 0x63, 0x0e, 0x5c, 0xf2, 0xac, 0xb4, 0x49, 0xcc, 0x9e, 0x4f, 0x94, 0x4c, 0x30, 0x12,
163
+    0xe8, 0x55, 0xc2, 0x44, 0xa4, 0x52, 0xcb, 0x61, 0x81, 0xc9, 0xb6, 0xa6, 0x6b, 0xef, 0xaf, 0xa6,
164
+    0x71, 0x1d, 0x7b, 0x58, 0x2f, 0xfa, 0xd1, 0x0c, 0x07, 0x9d, 0x1f, 0x35, 0x87, 0xbe, 0x02, 0x9f,
165
+    0x20, 0xc6, 0x60, 0x8f, 0x1c, 0x30, 0x0f, 0x96, 0xd0, 0x71, 0xd6, 0x51, 0x10, 0xdf, 0x5b, 0xf6,
166
+    0x44, 0x2f, 0x80, 0x28, 0xb7, 0xec, 0x23, 0x59, 0x4b, 0x94, 0x0b, 0x9a, 0x74, 0xa1, 0x1f, 0xf7,
167
+    0x9e, 0x76, 0xb4, 0xdf, 0xbb, 0x3c, 0x8c, 0x88, 0x97, 0x22, 0x56, 0x73, 0x16, 0x05, 0xac, 0xf9,
168
+    0x4f, 0x77, 0x9d, 0x38, 0xa0, 0x6b, 0x05, 0xd2, 0xe6, 0x15, 0x01, 0xb1, 0x5c, 0xc9, 0x62, 0xa9,
169
+    0x9b, 0x1a, 0x6a, 0x1a, 0xcf, 0xe6, 0xa8, 0xba,
170
+};
171
+
172
+
173
+////////// test 3: even key
174
+unsigned char test_3_key[0x8] = {
175
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
176
+};
177
+unsigned char test_3_encrypted[0x100] = {
178
+    0x47, 0x00, 0x00, 0x90,
179
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
180
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
181
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
182
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
183
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
184
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
185
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
186
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
187
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
188
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
189
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
190
+    0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
191
+
192
+};
193
+unsigned char test_3_expected[0x100] = {
194
+    0x47, 0x00, 0x00, 0x90,
195
+    0xfe, 0x91, 0xa7, 0x2f, 0xbf, 0xb0, 0x6a, 0x54, 0xc1, 0xe4, 0x33, 0x27, 0x18, 0xd5, 0x9c, 0x43,
196
+    0xea, 0xaa, 0x6b, 0x38, 0x5c, 0xe7, 0xae, 0xc9, 0xac, 0xec, 0xef, 0xc3, 0x51, 0x7d, 0x53, 0x47,
197
+    0xa0, 0xa7, 0x6d, 0x73, 0x8a, 0x9d, 0x16, 0x7d, 0x05, 0x2d, 0xd6, 0x6b, 0xf4, 0x8d, 0x4b, 0x81,
198
+    0x98, 0x2f, 0x46, 0xa5, 0x34, 0x84, 0xf3, 0x70, 0xa4, 0xe9, 0x04, 0x84, 0x7b, 0x87, 0x79, 0x3c,
199
+    0x01, 0x25, 0xb5, 0xfc, 0x3d, 0xd0, 0x25, 0xea, 0x2f, 0x91, 0xf0, 0x3f, 0x7f, 0xd4, 0x8e, 0x1e,
200
+    0x36, 0x83, 0x22, 0x91, 0x57, 0x92, 0x36, 0x0b, 0x44, 0xa5, 0xcc, 0x5e, 0xef, 0x44, 0x3e, 0xf8,
201
+    0xe9, 0x7b, 0x5e, 0x0c, 0xea, 0xb2, 0x50, 0x39, 0xb7, 0xea, 0xc4, 0xfb, 0xe4, 0x37, 0xf8, 0x85,
202
+    0xc2, 0xdc, 0x01, 0x98, 0x01, 0x2a, 0x44, 0xd3, 0x75, 0x10, 0x38, 0xf4, 0x85, 0x3e, 0xc9, 0xf7,
203
+    0xe7, 0xe4, 0xec, 0x40, 0x3d, 0x8f, 0xa5, 0xd2, 0x8a, 0xca, 0x62, 0x03, 0x3f, 0x65, 0x28, 0x8d,
204
+    0xf5, 0x56, 0xa7, 0xea, 0xd1, 0x0d, 0x70, 0x82, 0xbc, 0x90, 0x59, 0xf8, 0x3e, 0x08, 0xc9, 0xe1,
205
+    0x97, 0xef, 0x82, 0x43, 0x35, 0x41, 0x3e, 0x7f, 0x00, 0x96, 0x3f, 0x90, 0xe5, 0x1e, 0x96, 0xba,
206
+    0xce, 0x6d, 0xd2, 0x54, 0xce, 0x84, 0x76, 0x3c
207
+};
208
+
209
+
210
+////////// odd key, only 80 (0x50) bytes of payload (10 groups of 8 bytes + 0 byte residue)
211
+unsigned char test_p_10_0_key[0x8] = {
212
+    0x2d, 0x11, 0x5f, 0x9d, 0x29, 0xbf, 0x7f, 0x67
213
+};
214
+unsigned char test_p_10_0_encrypted[0x100] = {
215
+  0x47, 0x00, 0x7a, 0xbe,
216
+  0x67, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
217
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
218
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
219
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
220
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
221
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
222
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x71, 0xa5, 0x7b, 0x8f, 0xf9, 0x87, 0xcb, 0xac,
223
+  0xea, 0x08, 0x0c, 0x02, 0x87, 0x7b, 0xad, 0x10, 0x40, 0x28, 0x8e, 0xd4, 0x4e, 0x62, 0xc7, 0x74,
224
+  0xd6, 0xbb, 0x3a, 0xaa, 0xb0, 0x7b, 0x70, 0xbe, 0x06, 0xc9, 0xdc, 0x07, 0xd2, 0x2d, 0xab, 0x2d,
225
+  0xe2, 0xc6, 0x36, 0xa6, 0xda, 0x64, 0x61, 0x15, 0xd1, 0x6a, 0x40, 0xc0, 0xa9, 0xfb, 0x3f, 0xb2,
226
+  0x6d, 0xa5, 0x59, 0xae, 0x57, 0x88, 0x6b, 0x0e, 0x00, 0xae, 0xce, 0x64, 0xee, 0xfd, 0xb1, 0x7f,
227
+  0x78, 0x9c, 0x12, 0x42, 0xbe, 0x30, 0x8a, 0xa3 
228
+};
229
+unsigned char test_p_10_0_expected[0x100] = {
230
+  0x47, 0x00, 0x7a, 0xbe,
231
+  0x67, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
232
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
233
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
234
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
235
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
236
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
237
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xa7, 0xca, 0x32, 0xaf, 0x2e, 0x6a, 0xea, 0x05,
238
+  0x39, 0x33, 0x67, 0x5d, 0xa3, 0x61, 0x0f, 0x34, 0x40, 0x6c, 0x1a, 0xb3, 0xee, 0x54, 0x64, 0xd5,
239
+  0xa3, 0x01, 0x95, 0x87, 0x9d, 0x3d, 0x38, 0xc5, 0x82, 0x8b, 0x8d, 0xab, 0xad, 0x93, 0x0f, 0xe8,
240
+  0xf9, 0xbd, 0x52, 0x98, 0x59, 0xb2, 0x41, 0x95, 0xcd, 0xae, 0x9b, 0x3e, 0xdf, 0xdb, 0x14, 0x9b,
241
+  0xa9, 0x22, 0x0d, 0x2d, 0x61, 0xf5, 0xf2, 0x52, 0x83, 0x20, 0xae, 0xb8, 0x83, 0x52, 0x02, 0xee,
242
+  0xbd, 0xd2, 0x94, 0x6c, 0x27, 0x58, 0x55, 0xd0
243
+};
244
+
245
+
246
+////////// odd key, only 14 (0x0e) bytes of payload (1 group of 8 bytes + 6 byte residue)
247
+unsigned char test_p_1_6_key[0x8] = {
248
+    0x2d, 0x11, 0x5f, 0x9d, 0x29, 0xbf, 0x7f, 0x67
249
+};
250
+unsigned char test_p_1_6_encrypted[0x100] = {
251
+  0x47, 0x00, 0x7a, 0xb7,
252
+  0xa9, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
253
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
254
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
255
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
256
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
257
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
258
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
259
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
260
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
261
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
262
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xc0, 0x5e, 0xfb, 0xc8, 0x4a, 0x63,
263
+  0xe3, 0x3c, 0x11, 0xd9, 0xe0, 0x75, 0x8e, 0xf2 
264
+};
265
+unsigned char test_p_1_6_expected[0x100] = {
266
+  0x47, 0x00, 0x7a, 0xb7,
267
+  0xa9, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
268
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
269
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
270
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
271
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
272
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
273
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
274
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
275
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
276
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
277
+  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x5a, 0x2c, 0xee, 0xb3, 0xde, 0x92,
278
+  0xe7, 0xa6, 0x6c, 0xaa, 0x99, 0x84, 0xe4, 0x00 
279
+};

+ 59
- 0
FFdecsa/Makefile View File

@@ -0,0 +1,59 @@
1
+##### compiling with g++ gives a little more speed
2
+#COMPILER=gcc
3
+#COMPILER=g++
4
+
5
+###there are two functions which apparently don't want to be inlined
6
+#FLAGS=-O3 -march=athlon-xp -fexpensive-optimizations -funroll-loops -finline-limit=6000000 --param max-unrolled-insns=500
7
+#FLAGS=-O3 -march=athlon-xp -fexpensive-optimizations -funroll-loops --param max-unrolled-insns=500
8
+#FLAGS=-O3 -march=pentium3 -fexpensive-optimizations -funroll-loops
9
+
10
+###icc crashes for unknown reasons
11
+#COMPILER=/opt/intel_cc_80/bin/icc
12
+#FLAGS=-O3 -march=pentiumiii
13
+
14
+#FLAGS += -g
15
+#FLAGS += -fno-alias
16
+#FLAGS += -vec_report3
17
+#FLAGS += -Wall -Winline
18
+#FLAGS += -fomit-frame-pointer 
19
+#FLAGS += -pg
20
+
21
+COMPILER ?= g++
22
+FLAGS    ?= -Wall -fPIC -O3 -march=pentium -mmmx -fomit-frame-pointer -fexpensive-optimizations -funroll-loops
23
+
24
+H_FILES = FFdecsa.h parallel_generic.h parallel_std_def.h fftable.h \
25
+          parallel_032_4char.h \
26
+          parallel_032_int.h \
27
+          parallel_064_2int.h \
28
+          parallel_064_8charA.h \
29
+          parallel_064_8char.h \
30
+          parallel_064_long.h \
31
+          parallel_064_mmx.h \
32
+          parallel_128_16charA.h \
33
+          parallel_128_16char.h \
34
+          parallel_128_2long.h \
35
+          parallel_128_2mmx.h \
36
+          parallel_128_4int.h \
37
+          parallel_128_sse2.h \
38
+          parallel_128_sse.h
39
+
40
+all: FFdecsa.o FFdecsa_test.done
41
+
42
+%.o: %.c
43
+	$(COMPILER) $(FLAGS) -DPARALLEL_MODE=$(PARALLEL_MODE) -c $<
44
+
45
+FFdecsa_test:	FFdecsa_test.o FFdecsa.o
46
+	$(COMPILER) $(FLAGS) -o FFdecsa_test FFdecsa_test.o FFdecsa.o
47
+
48
+FFdecsa_test.o: FFdecsa_test.c FFdecsa.h FFdecsa_test_testcases.h
49
+FFdecsa.o: 	FFdecsa.c stream.c $(H_FILES)
50
+
51
+FFdecsa_test.done: FFdecsa_test
52
+	@./FFdecsa_test
53
+	@touch FFdecsa_test.done
54
+
55
+clean:
56
+	@rm -f FFdecsa_test FFdecsa_test.done FFdecsa_test.o FFdecsa.o
57
+
58
+test:	FFdecsa_test
59
+	sync;usleep 200000;nice --19 ./FFdecsa_test

+ 50
- 0
FFdecsa/README View File

@@ -0,0 +1,50 @@
1
+-------
2
+FFdecsa
3
+-------
4
+version 1.0
5
+Copyright 2003-2004  fatih89r
6
+released under GPL
7
+
8
+
9
+FFdecsa is a fast implementation of a CSA decryption algorithm for MPEG
10
+TS packets. It is shockingly fast, more than 800% the speed of the
11
+fastest implementation I can find around. (read the docs to know what FF
12
+stands for)
13
+
14
+On an AthlonXP 2400 (2000MHz) it achieves 165Mbit/s; the previous record
15
+was around 20Mbit/s.
16
+
17
+This means that:
18
+- decrypting a 8Mbit/s stream takes 5% of CPU instead of 40%
19
+- decrypting a full transponder (with all its channels or with a big
20
+  HDTV stream) carrying 38Mbit/s takes 23% of CPU instead of 190%
21
+  (>100%, so undecryptable in real time)
22
+- a very slow processor can decrypt one channel with no problems
23
+- offline decoding of one hour of a 5Mbit/s channel takes less than
24
+  two minutes (30x than realtime)
25
+- offline decoding will work at more than 20MB/s (megabytes/s),
26
+  nearly as fast as a file copy
27
+
28
+The docs directory contains useful stuff:
29
+
30
+  FAQ.txt
31
+    to know something more about this software
32
+
33
+  how_to_compile.txt
34
+    if you want to compile this code (and get optimal speed)
35
+
36
+  how_to_use.txt
37
+    if you want to use this code
38
+
39
+  technical_background.txt
40
+    if you want to understand how this code works or you want to
41
+    modify/improve it
42
+
43
+  how_to_understand.txt
44
+    if you want to understand the code to make modifications
45
+
46
+  how_to_release.txt
47
+    if you want to release modified versions of the code
48
+
49
+
50
+fatih89r

+ 77
- 0
FFdecsa/docs/FAQ.txt View File

@@ -0,0 +1,77 @@
1
+-------
2
+FFdecsa
3
+-------
4
+
5
+FFdecsa is a fast implementation of the CSA decryption algorithm for MPEG
6
+TS packets.
7
+
8
+Q: What does FF stands for?
9
+A: FFdecsa means "Fucking Fast decsa".
10
+
11
+Q: Why would you use such a rude name?
12
+A: Because this code is fucking fast, more than 800% the speed of the best
13
+   implementation I'm able to find around at the moment.
14
+
15
+Q: How it that possible? Are all other programmers stupid?
16
+A: No, they just tried to save a cycle or two tweaking a fundamentally wrong
17
+   implementation. The algorithm has to be implemented in a totally different
18
+   way to achieve good speed.
19
+
20
+Q: Do you use multimedia instructions?
21
+A: I use every trick I could come up with, including multimedia instructions.
22
+   They are not fundamental in achieving speed, a version without them runs
23
+   at 6x the speed of the best implementation around (which uses MMX).
24
+
25
+Q: So how did you do that?
26
+A: By using a different approach for the implementation. This code is not
27
+   exploiting some new CSA vulnerability, it is just doing the same
28
+   calculations better. Think about replacing bubble sort with quick sort.
29
+
30
+Q: You're joking, it's impossible to gain so much speed.
31
+A: Speed test are available, technical documentation is available, source
32
+   code is available. Try it yourself.
33
+   If you want details, these are some of the documented tricks I used
34
+   (more details in the docs directory):
35
+    TRICK NUMBER 0: emulate the hardware
36
+    TRICK NUMBER 1: virtual shift registers
37
+    TRICK NUMBER 2: parallel bitslice
38
+    TRICK NUMBER 3: multimedia instructions
39
+    TRICK NUMBER 4: parallel byteslice
40
+    TRICK NUMBER 5: efficient bit permutation
41
+    TRICK NUMBER 6: efficient normal<->slice conversion
42
+    TRICK NUMBER 7: try hard to process packets together
43
+    TRICK NUMBER 8: try to avoid doing the same thing many times
44
+    TRICK NUMBER 9: compiler
45
+    TRICK NUMBER a: a lot of brain work
46
+
47
+Q: How can be this code useful?
48
+A: You can use this code in place of the old slow implementations and save a
49
+   lot of CPU power.
50
+
51
+Q: Just that?
52
+A: Well, new applications are possible.
53
+   Decrypting a whole transponder is easily doable now. Well, a $50 CPU can
54
+   decrypt four transponder at the same time if you have four DVB boards (but
55
+   I couldn't test that).
56
+
57
+Q: You're cheating, this code is fake, I don't believe one word.
58
+A: Go away. This is technical stuff for people with brains.
59
+
60
+Q: This code is great, may I distribute your code in original or modified
61
+   form?
62
+A: Only if you respect the license.
63
+
64
+Q: May I use your code in my player/library/plugin...?
65
+A: Again, you have to respect the license.
66
+
67
+Q: Are you an extraterrestrial programmer?
68
+A: No, just a Turkish guy with a PC to play with :-)
69
+
70
+Q: Why did you spend your time doing this?
71
+A: Because I thought that my approach was doable and I was sure it would
72
+   have been much faster, so I had to implement it to confirm I was right.
73
+   I got 8x the speed and that's enough to be proud of it. And I could not
74
+   just keep the code for myself only.
75
+
76
+Q: What is the answer to the meaning of the universe?
77
+A: 42,43,71,5f,65,85,f6,76,0d,13,28,96,...

+ 114
- 0
FFdecsa/docs/how_to_compile.txt View File

@@ -0,0 +1,114 @@
1
+-------
2
+FFdecsa
3
+-------
4
+
5
+Compiling is as easy as running a make command, if you have gcc and are
6
+using a little endian machine. 64 bit machines have not been tested but
7
+may work with little or no changes; big endian machines will certainly
8
+give incorrect results (read the technical_background.txt to know where
9
+the problem is).
10
+
11
+Before compiling you could edit the Makefile to tweak compiler flags for
12
+optimal performance. If you want to play with different bit-grouping
13
+strategies you have to edit FFdecsa_DBG.c and change the "our choice"
14
+definition. This is highly critical for performance.
15
+
16
+After compilation run the FFdecsa_test application. It will test correct
17
+decryption and print the meausered speed (use "nice --19 ./FFdecsa_test"
18
+on an idle machine for better results). Or just use "make test".
19
+
20
+gcc >=3.3.3 is highly recommended. Older versions could give performance
21
+problems.
22
+
23
+icc is currently unusable. In the initial phases of development of
24
+FFdecsa icc was able to compile the code and gave interesting speed
25
+results when using the 8charA grouping mode (array of 8 characters are
26
+automatically manipulated through MMX instructions). At some point the
27
+code began to work incorrectly because of a compiler bug (but I found a
28
+workaround). Then, the performance dropped with no reason; I found a
29
+workaround by adding an unused variable (alignment problem, grep for icc
30
+in the code to see where it happens). Then, with the introduction of
31
+group modes based on intrinsics, gcc was finally able to go beyond the
32
+speed record originally set by icc. Additional code tweaks added more
33
+speed to gcc, while icc started to segfault on compilation (both version
34
+7 and 8). In conclusion, icc is bugged and this code is too hard for it.
35
+gcc on the other hand is great. I tried to inspect generated assembler
36
+to find weak spots, and the generated code is very good indeed.
37
+
38
+Note: the code can be compiled with gcc or g++. g++ is 3% faster for
39
+some reason.
40
+
41
+You should not get any errors or warnings. I only get two "inlining
42
+failed" warnings on two functions I asked to be inlined but gcc doesn't
43
+want to inline.
44
+
45
+The build process creates additional temp files by running grep
46
+commands. This is how debugging output is handled. All the lines
47
+containing DBG are removed and the temp file is compiled (so the line
48
+numbers change between temp and original files). Don't edit the temp
49
+files, they will be overwritten. If you don't remove the DBG lines (for
50
+example, by changing "grep -v DBG" into "grep -v aaDBG" in Makefile) a
51
+lot of output will be generated. This is useful to understand what's
52
+wrong when the FFdecsa_test is failing. I included a reference "known
53
+good" output in the debug_output directory. Extra debug output is
54
+commented out in the code.
55
+
56
+The debug output functionality could be... bugged. This is because I
57
+tested everything using hard coded int grouping mode and then
58
+generalized the debug output to abstract grouping modes. A bug where 4
59
+bytes are printed instead of 8 could be present somewhere. I think it
60
+isn't, but you've been warned.
61
+
62
+This code was only tried on Linux.
63
+It should work on Windows or other platforms, but you may encounter
64
+problems related to the compiler quality. If you want to try, begin with
65
+the int grouping mode. It is only 30% slower then the best (MMX) and it
66
+should be easily portable because no intrinsics are used. I'm
67
+particularly interested in hearing what kind of performance can be
68
+obtained on x86_64 processors in int, long long int, mmx, 2mmx, sse
69
+modes.
70
+
71
+
72
+As a reference, here are the results I get on an Athlon XP 2400+ (this
73
+processor runs at 2000MHz); other processors belonging to the Athlon XP
74
+architecture, including Durons, should have the same speed per MHz.
75
+Cache size and bus speed don't matter.
76
+
77
+CPU: AMD Athlon XP 2400+
78
+
79
+Compiler: g++ (gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7))
80
+
81
+Flags: -O3 -march=athlon-xp -fexpensive-optimizations -funroll-loops
82
+       --param max-unrolled-insns=500
83
+
84
+grouping mode           speed (Mbit/s)    notes
85
+---------------------------------------------------------------------
86
+PARALLEL_32_4CHAR            14
87
+PARALLEL_32_4CHARA           12
88
+PARALLEL_32_INT             125           very good and very portable
89
+PARALLEL_64_8CHAR            17
90
+PARALLEL_64_8CHARA           15           needs a vectorizing compiler
91
+PARALLEL_64_2INT             75           x86 has too few registers
92
+PARALLEL_64_LONG             97           try this on x86_64
93
+PARALLEL_64_MMX             165           the best
94
+PARALLEL_128_16CHAR           6
95
+PARALLEL_128_16CHARA          7
96
+PARALLEL_128_4INT            69
97
+PARALLEL_128_2LONG           52
98
+PARALLEL_128_2MMX            36           slower than expected
99
+PARALLEL_128_SSE            156           just slower than 64_MMX
100
+
101
+Best speeds are obtained with native data types: int, mmx, sse (this
102
+could be a compiler artifact).
103
+
104
+64 bit processors should try 64_LONG.
105
+
106
+Vectorizing compilers should like *CHARA.
107
+
108
+64_MMX is faster than 128_SSE on the Athlon; perhaps SSE instruction are
109
+internally split into 64 bit chunks. Could be different on x86_64 or
110
+Intel processors.
111
+
112
+128_SSE has a 64 bit (MMX) batch type because SSE has no shifting
113
+instructions, they are only available on SSE2. As the Athlon XP doesn't
114
+support SSE2, I couldn't experiment with that.

+ 21
- 0
FFdecsa/docs/how_to_release.txt View File

@@ -0,0 +1,21 @@
1
+-------
2
+FFdecsa
3
+-------
4
+
5
+Please use the name of the release you're basing on as a base name and
6
+add your suffix.
7
+
8
+For example if john modifies
9
+  FFdecsa-1.0.0
10
+he should release
11
+  FFdecsa-1.0.0-john_0.3
12
+or
13
+  FFdecsa-1.0.0-john_0.4
14
+
15
+If paul modifies john's version the correct name would be like
16
+  FFdecsa-1.0.0-john_0.4-paul_0.1
17
+
18
+This is to avoid many different versions with random version numbers, as
19
+development is not centralized.
20
+
21
+Thank you.

+ 15
- 0
FFdecsa/docs/how_to_understand.txt View File

@@ -0,0 +1,15 @@
1
+-------
2
+FFdecsa
3
+-------
4
+
5
+First, you need to know how decsa works, study the source of a classical
6
+implementation. Then you have to understand how things are done in
7
+slicing mode. Read all the documentation and have a working classical
8
+implementation to compare partial results. There are comments spread
9
+around the code. Some things are difficult to understand without paper
10
+notes; for example the matrix transpositions and meaning of array
11
+indices.
12
+
13
+Sorry, it is hard to understand and modify ...
14
+
15
+... but it was harder to design and implement!!!

+ 239
- 0
FFdecsa/docs/how_to_use.txt View File

@@ -0,0 +1,239 @@
1
+-------
2
+FFdecsa
3
+-------
4
+
5
+This code is able to decrypt MPEG TS packets with the CSA algorithm. To
6
+achieve high speed, the decryption core works on many packets at the
7
+same time, so the interface is more complicated than usual decsa
8
+implementations.
9
+
10
+The FFdecsa.h file defines the external interface of this code.
11
+
12
+Basically:
13
+
14
+1) you use get_suggested_cluster_size to know the optimal number of
15
+packets you have to pass for decryption
16
+
17
+2) you use set_control_words to set the decryption keys
18
+
19
+3) you use decrypt_packets to do the actual decryption
20
+
21
+You don't need to always use set_control_words before decrypt_packets,
22
+if keys aren't changed. 
23
+
24
+
25
+The decrypt_packets function call decrypts many packets at the same
26
+time. The interface is complicated because the only design goal was
27
+speed, so it implements zero-copying of packets, out-of-order decryption
28
+and optimal packet aggregation for better parallelism. This part is the
29
+most difficult to understand.
30
+
31
+--- HOW TO USE int decrypt_packets(unsigned char **cluster); ---
32
+
33
+PARAMETERS
34
+  cluster points to an array of pointers, representing zero or more
35
+  ranges. Every range has a start and end pointer; a start pointer==NULL
36
+  terminates the array.
37
+  So, an array of pointers has this content:
38
+    start_of_buffer_1, end_of_buffer_1, ... start_of_buffer_N,
39
+    end_of_buffer_N, NULL
40
+  example:
41
+    0x12340000, 0x123400bc, 0x56780a00, 0x5678b78, NULL
42
+  has two ranges (0x12340000 - 0x123400bc and  0x56780a00 - 0x5678b78),
43
+  for a total of three packets (starting at 0x12340000, 0x56780a00,
44
+  0x5678abc)
45
+RETURNS
46
+  How many packets can now be consumed by the caller, this is always >=
47
+  1, unless the cluster contained zero packets (in that case it's
48
+  obviously zero).
49
+MODIFIES
50
+  The cluster is modified to try to exclude packets which shouldn't be
51
+  submitted again for decryption (because just decrypted or originally
52
+  not crypted). "Try to exclude" because the returned array will never
53
+  be bigger than what was passed, so if you passed only a range and some
54
+  packets in the middle were decrypted making "holes" into the range,
55
+  the range would have to be split into several ranges, and that will
56
+  not be done. If you want a strict description of what has to be passed
57
+  again to decrypt_packets, you have to use ranges with only one packet
58
+  inside. Note that the first packet will certainly be eliminated from
59
+  the returned cluster (see also RETURNS).
60
+
61
+You can now read the detailed description of operation or just skip to
62
+the API examples.
63
+
64
+
65
+---------------------------------
66
+DETAILED DESCRIPTION OF OPERATION
67
+---------------------------------
68
+  consider a sequence of packets like this:
69
+   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ...
70
+   E  E  E  E  E  E  E  E  E  E  E  O  E  O  E  O  O  0  0  0  0  0  0  0  0  c  O  O  O  O  O  O  O  O  O  O  O ...
71
+  where
72
+   E = encrypted_even,
73
+   O = encrypted_odd,
74
+   e = clear_was_encrypted_even,
75
+   o = clear_was_encrypted_odd,
76
+   c = clear
77
+  and suppose the suggested cluster size is 10 (this could be for a function with internal parallelism 8)
78
+
79
+  1) we define the cluster to include packets 0-9 and
80
+  call decrypt_packets
81
+  a possible result is that the function call
82
+  - returns 8 (8 packets available)
83
+  - the buffer contains now this
84
+  -----------------------------
85
+   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ...
86
+   e  e  e  e  e  e  e  e  E  E  E  O  E  O  E  O  O  0  0  0  0  0  0  0  0  c  O  O  O  O  O  O  O  O  O  O  O ...
87
+                          -----
88
+  - the modified cluster covers 8-9 [continue reading, but then see note 1 below]
89
+  so, we can use the first 8 packets of the original cluster (0-7)
90
+
91
+  2) now, we define cluster over 8-17 and call decrypt_packets
92
+  a possible result is:
93
+  - returns 3 (3 packets available)
94
+  - the buffer contains now this (!!!)
95
+                          -----------------------------
96
+   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ...
97
+   e  e  e  e  e  e  e  e  e  e  e  O  e  O  e  O  O  0  0  0  0  0  0  0  0  c  O  O  O  O  O  O  O  O  O  O  O ...
98
+                                   --    --    --------
99
+  - the modified cluster covers 11-11,13-13,15-17 [continue reading, but then see note 1 below]
100
+  so, we can use the first 3 packets of the original cluster (8-10)
101
+
102
+  3) now, we define cluster over 11-20 and call decrypt packets (defining a cluster 11-11,13-13,15-22 would be better)
103
+  a possible result is:
104
+  - returns 10 (10 packets available)
105
+  - the buffer contains now this
106
+                                   -----------------------------
107
+   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ...
108
+   e  e  e  e  e  e  e  e  e  e  e  o  e  o  e  o  o  o  o  o  o  0  0  0  0  c  O  O  O  O  O  O  O  O  O  O  O ...
109
+
110
+  - the modified cluster is empty
111
+  so, we can use the first 10 packets of the original cluster (11-20)
112
+  What it happened is that the second call decrypted packets 12 and 14 but they were
113
+  not made available because packet 11 was still encrypted,
114
+  the third call decrypted 11,13,15-20 and included 12 and 14 as available too.
115
+
116
+  4) now, we define cluster over 21-30 and call decrypt packets
117
+  a possible result is:
118
+  - returns 9 (9 packets available)
119
+  - the buffer contains now this
120
+                                                                 -----------------------------
121
+   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ...
122
+   e  e  e  e  e  e  e  e  e  e  e  o  e  o  e  o  o  o  o  o  o  o  o  o  o  c  o  o  o  o  O  O  O  O  O  O  O ...
123
+                                                                                            --
124
+  - the modified cluster covers 30-30
125
+  so, we can use the first 9 packets of the original cluster (21-29)
126
+  What happened is that packet 25 could be skipped because it is in clear.
127
+
128
+  Note that the suggested cluster size (10) is higher than the maximum number
129
+  of packets that can be really decrypted (8), but we are able to skip 12 and 14
130
+  in step 3) and run the decryption on a full 8 packets group.
131
+  In the same way, we were able to skip 25 in step 4).
132
+  There are three kinds of "free" packets we can skip:
133
+  - packets decrypted in a previous call (as 12 and 14)
134
+  - packets already in clear (as 25)
135
+  - packets with a payload of less than 8 bytes (clear==encrypted!)
136
+
137
+  Note also that we could have defined a better cluster in step 3
138
+  (11-11,13-13,15-22), using what step 2 had returned. The risk of not
139
+  having 8 packets to decrypt would have been smaller (consider the case
140
+  where 19 and 20 were "c").
141
+
142
+  Final considerations:
143
+  - you can use a bigger or smaller cluster than the suggested number of packets
144
+  - every call to decrypt_packets has a *fixed* CPU cost, so you should try to
145
+    not run it with a few packets, when possible
146
+  - decrypt_packets can't decrypt even and odd at the same time; it guarantees
147
+    that the first packet will be decrypted and tries to decrypt as many packets
148
+    as possible
149
+  - clear packets in the middle of encrypted packets don't happen in real world,
150
+    but E,E,E,O,E,O,O,O sequences do happen (audio/video muxing problems?) and
151
+    small packets (<8 bytes) happen frequently; the ability to skip is useful.
152
+
153
+  note 1:
154
+    As the returned cluster will not have more ranges than the passed one, what it is
155
+    described above is not actually true.
156
+    In the step 1) the returned cluster will cover 8-9, but in step 2) it will
157
+    cover 11-17 (some extra packets had to remain in); this lack of information
158
+    prevents us from using an optimal 11-11,13-13,15-22 in step 3). Note that
159
+    in any case step 3) will decrypt 11,13,15,16,17,18,19,20 thanks to the
160
+    extra margin we use (we put ten packets (including 19 and 20) even if the
161
+    parallelism was just 8, and it was a good idea; but if 19 and 20 were of
162
+    type c, we would have run the decryption with only 6/8 efficiency).
163
+    This problem can be prevented by using ranges with only one packet: in
164
+    step 2) we would have passed
165
+    8-8,9-9,10-10,11-11,12-12,13-13,14-14,15-15,16-16,17-17
166
+    and got back
167
+    11-11,13-13,15-17.
168
+
169
+
170
+------------
171
+API EXAMPLES
172
+------------
173
+
174
+Some examples of how the API can be used (this is not real code, so it
175
+may have typos or other bugs).
176
+
177
+
178
+Example 1: (big linear buffer, simple use of cluster)
179
+
180
+  unsigned char *p;
181
+  unsigned char *cluster[3];
182
+  for(p=start;p<end;){
183
+    cluster[0]=p;cluster[1]=end;
184
+    cluster[2]=NULL;
185
+    p+=188*decrypt_packets(cluster);
186
+  }
187
+  //consume(start,end);
188
+
189
+
190
+Example 2: (circular buffer, simple use of cluster)
191
+
192
+  unsigned char *p;
193
+  unsigned char *cluster[5];
194
+
195
+  while(1){
196
+    if(read==write){
197
+      //buffer is empty
198
+      //write=refill_buffer(write,start,end);
199
+      continue;
200
+    }
201
+    else if(read<write){
202
+      cluster[0]=read;cluster[1]=write;
203
+      cluster[2]=NULL;
204
+    }
205
+    else{
206
+      cluster[0]=read;cluster[1]=end;
207
+      cluster[2]=start;cluster[3]=write;
208
+      cluster[4]=NULL;
209
+    }
210
+    new_read=read+188*decrypt_packets(cluster);
211
+    if(new_read<=end){
212
+      //consume(read,new_read);
213
+    }
214
+    else{
215
+      new_read=start+(new_read-end);
216
+      //consume(read,end);
217
+      //consume(start,new_read);
218
+    }
219
+    read=new_read;
220
+    if(read==end) read=start;
221
+  }
222
+
223
+
224
+Example 3: (undefined buffer structure, advanced use of cluster)
225
+
226
+  unsigned char *packets[1000000];
227
+  unsigned char *cluster[142]; //if suggested packets is 70
228
+  
229
+  cluster[0]=NULL;
230
+  for(n=0;n<1000000;){
231
+    i=0;
232
+    while(cluster[2*i]!=NULL) i++; //preserve returned ranges
233
+    for(k=i;k<70&&n<1000000;k++,n++){
234
+      cluster[2*k]=packets[n];cluster[2*k+1]=packets[n]+188;
235
+    }
236
+    cluster[2*k]=NULL;
237
+    decrypt_packets(cluster);
238
+  }
239
+  //consume_all_packets();

+ 341
- 0
FFdecsa/docs/technical_background.txt View File

@@ -0,0 +1,341 @@
1
+-------
2
+FFdecsa
3
+-------
4
+
5
+This doc is for people who looked into the source code and found it
6
+difficult to believe that this is a decsa algorithm, as it appears
7
+completely different from other decsa implementations.
8
+
9
+It appears different because it is different. Being different is what
10
+enables it to be a lot faster than all the others (currently it has more
11
+than 800% the speed of the best version I was able to find)
12
+
13
+The csa algo was designed to be run in hardware, but people are now
14
+running it in software.
15
+
16
+Hardware has data lines carrying bits and functional blocks doing
17
+calculations (logic operations, adders, shifters, table lookup, ...),
18
+software instead uses memory to contain data values and executes a
19
+sequence of instructions to transform the values. As a consequence,
20
+writing a software implementation of a hardware algorithm can be
21
+inefficient.
22
+
23
+For example, if you have 32 data lines, you can permutate the bits with
24
+zero cost in hardware (you just permute the physical traces), but if you
25
+have the bits in a 32 bit variable you have to use 32 "and" operations
26
+with 32 different masks, 32 shifts and 31 "or" operations (if you
27
+suggest using "if"s testing the bits one by one you know nothing about
28
+how jump prediction works in modern processors).
29
+
30
+So the approach is *emulating the hardware*.
31
+
32
+Then there are some additional cool tricks.
33
+
34
+TRICK NUMBER 0: emulate the hardware
35
+------------------------------------
36
+We will work on bits one by one, that is a 4 bit word is now four
37
+variables. In this way we revert complex software operations into
38
+hardware emulation:
39
+
40
+  software                      hardware
41
+  -------------------------------------------
42
+  copy values                   copy values
43
+  logic op                      logic op
44
+  (bit permut.) ands+shifts+ors copy values
45
+  additions                     logic op emulating adders
46
+  (comparisons) if              logic op selecting one of the two results
47
+  lookup tables                 logic op synthetizing a ROM (*)
48
+
49
+(*) sometimes lookup tables can be converted to logic expressions
50
+
51
+The sbox in the stream cypher have been converted to efficient logic
52
+operations using a custom written software (look into logic directory)
53
+and is responsible for a lot of speed increase. Maybe there exists a
54
+slightly better way to express the sbox as logical expressions, but it
55
+would be a minuscule improvement. The sbox in the block cypher can't be
56
+converted to efficient logic operations (8 bits of inputs are just too
57
+much) and is implemeted with a traditional lookup in an array.
58
+
59
+But there is a problem; if we want to process bits, but our external
60
+input and output wants bytes. We need conversion routines. Conversion
61
+routines are similar to the awful permutations we described before, so
62
+this has to be done efficiently someway.
63
+
64
+
65
+TRICK NUMBER 1: virtual shift registers
66
+---------------------------------------
67
+Shift registers are normally implemented by moving all data around.
68
+Better leave the data in the same memory locations and redefine where
69
+the start of the register is (updating a pointer). That is called
70
+virtual shift register.
71
+
72
+
73
+TRICK NUMBER 2: parallel bitslice
74
+---------------------------------
75
+Implementing the algorithm as described in tricks 1 and 2 give us about
76
+15% of the speed of a traditional implementation. This happens because
77
+we work on only one bit, even if our CPU is 32 bit wide. But *we can
78
+process 32 different packets at the same time*. This is called
79
+"bitslice" method. It can be done only if the program flow is not
80
+dependent of the data (if, while,...). Luckily this is true.
81
+Things like
82
+  if(a){
83
+    b=c&d;
84
+  }
85
+  else{
86
+    b=e&f;
87
+  }
88
+can be coded as (think of how hardware would implement this)
89
+  b1=c&d;
90
+  b2=e&f;
91
+  b=b2^(a&(b1^b2));
92
+and things like
93
+  if(a){
94
+    b=c&d
95
+  }
96
+can be transformed in the same way, as they may be written as
97
+  if(a){
98
+    b=c&d
99
+  }
100
+  else{
101
+    b=b;
102
+  }
103
+It could look wasteful, but it is not; and destroys data dependency.
104
+
105
+Our codes takes the same time as before, but produces 32 results, so
106
+speed is now 480% the speed of a traditional implementation.
107
+
108
+
109
+TRICK NUMBER 3: multimedia instructions
110
+---------------------------------------
111
+If our CPU is 32 bit but it can also process larger blocks of data
112
+efficiently (multimedia instructions), we can use them. We only need
113
+logic ops and these are typically available.
114
+
115
+We can use MMX and work on 64 packets, or SSE and work on 128 packets.
116
+The speed doesn't automatically double going from 32 to 64 because the
117
+integer registers of the processor are normally faster. However, some
118
+speed is gained in this way.
119
+
120
+Multimedia instructions are often used by writing assembler by hand, but
121
+compilers are very good in doing register allocation, loop unrolling and
122
+instruction scheduling, so it is better to write the code in C and use
123
+native multimedia data types (intrinsics).
124
+
125
+Depending on number of available registers, execution latency, number of
126
+execution units in the CPU, it may be good to process more than one data
127
+block at the same time, for example 2 64bit MMX values. In this case we
128
+work on 128 bits by simulating a 128 bit op with two consecutive 64 bit
129
+op. This may or may not help (apparently not because x86 architecture
130
+has a small number of registers).
131
+
132
+We can also try working on 96 bit, pairing a MMX and an int op, or 192
133
+bit by using MMX and SSE. While this is doable in theory and could
134
+exploit different execution units in the CPU, speed doesn't improve
135
+(because of cache line handling problems inside the CPU, maybe).
136
+
137
+Besides int, MMX, SSE, we can use long long int (64 bit) and, why not,
138
+unsigned char.
139
+
140
+Using groups of unsigned chars (8 or 16) could give the compiler an
141
+opportunity to insert multimedia instructions automatically. For
142
+example, icc can use one MMX istruction to do
143
+  unsigned char a[8],b[8],c[8];
144
+  for(i=0;i<8;i++){
145
+    a[i]=b[i]&c[i];
146
+  }
147
+Some compilers (like icc) are efficient in this case, but using
148
+intrinsics manually is generally faster.
149
+
150
+All these experiments can be easily done if the code is written in a way
151
+which abstracts the data type used. This is not easy but doable, all the
152
+operations on data become (inlined) function calls or preprocessor
153
+macros. Good compilers are able to simplify all the abstraction at
154
+compile time and generate perfect code (gcc is great).
155
+
156
+The data abstraction used in the code is called "group".
157
+
158
+
159
+TRICK NUMBER 4: parallel byteslice
160
+----------------------------------
161
+The bitslice method works wonderfully on the stream cypher, but can't be
162
+applied to the block cypher because of the evil big look up table.
163
+
164
+As we have to convert input data from normal to bitslice before starting
165
+processing and from bitslice to normal before output, we convert the
166
+stream cypher output to normal before the block calculations and do the
167
+block stage in a traditional way.
168
+
169
+There are some xors in the block cypher; so we arrange bytes from
170
+different packets side by side and use multimedia instructions to work
171
+on many bytes at the same time. This is not exactly bitslice, maybe it
172
+is called byteslice. The conversion routines are similar (just a bit
173
+simpler).
174
+
175
+The data type we use to do this in the code is called "batch".
176
+
177
+The virtual shift register described in trick number 2 is useful too.
178
+
179
+The look up table is the only thing which is done serially one byte at a
180
+time. Luckily if we do it on 32 or 64 bytes the loop is heavily
181
+unrolled, and the compiler and the CPU manage to get a good speed
182
+because there is little dependency between instructions.
183
+
184
+
185
+TRICK NUMBER 5: efficient bit permutation
186
+-----------------------------------------
187
+The block cypher has a bit permutation part. As we are not in a bit
188
+sliced form at that point, permuting bits in a byte takes 8 masks, 8
189
+and, 7 or; but three bits move in the same direction, so we make it with
190
+6 masks, 6 and, 5 or. Batch processing through multimedia instructions
191
+is applicable too.
192
+
193
+
194
+TRICK NUMBER 6: efficient normal<->slice conversion
195
+---------------------------------------------------
196
+The bitslice<->normal conversion routines are a sort of transposition
197
+operation, that is you have bits in rows and want them in columns. This
198
+can be done efficiently. For example, transposition of 8 bytes (matrix
199
+of 8x8=64 bits) can be done this way (we want to exchange bit[i][j] with
200
+bit[j][i] and we assume bit 0 is the MSB in the byte):
201
+
202
+  // untested code, may be bugged
203
+  unsigned char a[8];
204
+  unsigned char b[8];
205
+  for(i=0;i<8;i++) b[i]=0;
206
+  for(i=0;i<8;i++){
207
+    for(j=0;j<8;j++){
208
+      b[i]|=((a[j]>>(7-i)&1))<<(7-j);
209
+    }
210
+  }
211
+
212
+but it is slow (128 shifts, 64 and, 64 or), or
213
+
214
+  // untested code, may be bugged
215
+  unsigned char a[8];
216
+  unsigned char b[8];
217
+  for(i=0;i<8;i++) b[i]=0;
218
+  for(i=0;i<8;i++){
219
+    for(j=0;j<8;j++){
220
+      if(a[j]&(1<<(7-i))) b[i]|=1<<(7-j);
221
+    }
222
+  }
223
+
224
+but is very very slow (128 shifts, 64 and, 64 or, 128 unpredictable
225
+if!), or using a>>=1 and b<<=1, which gains you nothing, or
226
+
227
+  // untested code, may be bugged
228
+  unsigned char a[8];
229
+  unsigned char b[8];
230
+  unsigned char top,bottom;
231
+  for(j=0;j<1;j++){
232
+    for(i=0;i<4;i++){
233
+      top=   a[8*j+i];
234
+      bottom=a[8*j+4+i];
235
+      a[8*j+i]=   (top&0xf0)    |((bottom&0xf0)>>4);
236
+      a[8*j+4+i]=((top&0x0f)<<4)| (bottom&0x0f);
237
+    }
238
+  }
239
+  for(j=0;j<2;j++){
240
+    for(i=0;i<2;i++){
241
+      top=   a[4*j+i];
242
+      bottom=a[4*j+2+i];
243
+      a[4*j+i]  = (top&0xcc)    |((bottom&0xcc)>>2);
244
+      a[4*j+2+i]=((top&0x33)<<2)| (bottom&0x33);
245
+    }
246
+  }
247
+  for(j=0;j<4;j++){
248
+    for(i=0;i<1;i++){
249
+      top=   a[2*j+i];
250
+      bottom=a[2*j+1+i];
251
+      a[2*j+i]  = (top&0xaa)    |((bottom&0xaa)>>1);
252
+      a[2*j+1+i]=((top&0x55)<<1)| (bottom&0x55);
253
+    }
254
+  }
255
+  for(i=0;i<8;i++) b[i]=a[i]; //easy to integrate into one of the stages above
256
+
257
+which is very fast (24 shifts, 48 and, 24 or) and has redundant loops
258
+and address calculations which will be optimized away by the compiler.
259
+It can be written as 3 nested loops but it becomes less readable and
260
+makes it difficult to have results in b without an extra copy. The
261
+compiler always unrolls heavily.
262
+
263
+The gain is much bigger when operating with 32 bit or 64 bit values (we
264
+are going from N^2 to Nlog(N)). This method is used for rectangular
265
+matrixes too (they have to be seen as square matrixes side by side).
266
+Warning: this code is not *endian independent* if you use ints to work
267
+on 4 bytes. Running it on a big endian processor will give you a
268
+different and strange kind of bit rotation if you don't modify masks and
269
+shifts.
270
+
271
+This is done in the code using int or long long int. It should be
272
+possible to use MMX instead of long long int and it could be faster, but
273
+this code doesn't cost a great fraction of the total time. There are
274
+problems with the shifts, as multimedia instructions do not have all
275
+possible kind of shift we need (SSE has none!).
276
+
277
+
278
+TRICK NUMBER 7: try hard to process packets together
279
+----------------------------------------------------
280
+As we are able to process many packets together, we have to avoid
281
+running with many slots empty. Processing one packet or 64 packets takes
282
+the same time if the internal parallelism is 64! So we try hard to
283
+aggregate packets that can be processed together; for simplicity reasons
284
+we don't mix packets with even and odd parity (different keys), even if
285
+it should be doable with a little effort. Sometimes the transition from
286
+even to odd parity and viceversa is not sharp, but there are sequences
287
+like EEEEEOEEOEEOOOO. We try to group all the E together even if there
288
+are O between them. This out-of-order processing complicates the
289
+interface to the applications a bit but saves us three or four runs with
290
+many empty slots.
291
+
292
+We have also logic to process together packets with a different size of
293
+the payload, which is not always 184 bytes. This involves sorting the
294
+packets by size before processing and careful operation of the 23
295
+iteration loop to exclude some packets from the calculations. It is not
296
+CPU heavy.
297
+
298
+Packets with payload <8 bytes are identical before and after decryption
299
+(!), so we skip them without using a slot. (according to DVB specs these
300
+kind of packets shouldn't happen, but they are used in the real world).
301
+
302
+
303
+TRICK NUMBER 8: try to avoid doing the same thing many times
304
+------------------------------------------------------------
305
+Some calculations related to keys are only done when the keys are set,
306
+then all the values depending on keys are stored in a convenient form
307
+and used everytime we convert a group of packets.
308
+
309
+
310
+TRICK NUMBER 9: compiler
311
+------------------------
312
+
313
+Compilers have a lot of optimization options. I used -march to target my
314
+CPU and played with unsual options. In particular
315
+  "--param max-unrolled-insns=500"
316
+does a good job on the tricky table lookup in the block cypher. Bigger
317
+values unroll too much somewhere and loose speed. All the testing has
318
+been done on an AthlonXP CPU with a specific version of gcc
319
+  gcc version 3.3.3 20040412 (Red Hat Linux 3.3.3-7)
320
+Other combinations of CPU and compiler can give different speeds. If the
321
+compiler is not able to simplify the group and batch structures and
322
+stores everything in memory instead of registers, performance will be
323
+low.
324
+
325
+Absolutely use a good compiler!
326
+
327
+Note: the same code can be compiled in C or C++ mode. g++ gives a 3%
328
+speed increase compared to gcc (I suppose some stricter constraint on
329
+array and pointers in C++ mode gives the optimizer more freedom).
330
+
331
+
332
+TRICK NUMBER a: a lot of brain work
333
+-----------------------------------
334
+The code started as very slow but correct implementation and was then
335
+tweaked for months with a lot of experimentation and by adding all the
336
+good ideas one after another to achieve little steps toward the best
337
+speed possible, while continously testing that nothing had been broken.
338
+
339
+Many hours were spent on this code.
340
+
341
+Enjoy the result.

+ 56
- 0
FFdecsa/fftable.h View File

@@ -0,0 +1,56 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2007 Dark Avenger
4
+ *               2003-2004  fatih89r
5
+ *
6
+ * This program is free software; you can redistribute it and/or modify
7
+ * it under the terms of the GNU General Public License as published by
8
+ * the Free Software Foundation; either version 2 of the License, or
9
+ * (at your option) any later version.
10
+ *
11
+ * This program is distributed in the hope that it will be useful,
12
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14
+ * GNU General Public License for more details.
15
+ *
16
+ * You should have received a copy of the GNU General Public License
17
+ * along with this program; if not, write to the Free Software
18
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
+ */
20
+
21
+#ifndef FFTABLE_H
22
+#define FFTABLE_H
23
+
24
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data)
25
+{
26
+#if 0
27
+  *(((int *)tab)+2*g)=*((int *)data);
28
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
29
+#else
30
+  *(((long long *)tab)+g)=*((long long *)data);
31
+#endif
32
+}
33
+
34
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g)
35
+{
36
+#if 1
37
+  *((int *)data)=*(((int *)tab)+2*g);
38
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
39
+#else
40
+  *((long long *)data)=*(((long long *)tab)+g);
41
+#endif
42
+}
43
+
44
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g)
45
+{
46
+  int j;
47
+  for(j=0;j<n;j++) *(data+j)^=*(tab+8*g+j);
48
+}
49
+
50
+#undef XOREQ_BEST_BY
51
+static inline void XOREQ_BEST_BY(unsigned char *d, unsigned char *s)
52
+{
53
+	XOR_BEST_BY(d, d, s);
54
+}
55
+
56
+#endif //FFTABLE_H 

+ 10
- 0
FFdecsa/logic/Makefile View File

@@ -0,0 +1,10 @@
1
+all: logic
2
+
3
+logic: logic.o
4
+	gcc -o logic logic.o
5
+
6
+logic.o: logic.c
7
+	gcc -O3 -march=athlon-xp -c logic.c
8
+
9
+clean:
10
+	rm logic *.o

+ 330
- 0
FFdecsa/logic/logic.c View File

@@ -0,0 +1,330 @@
1
+/* logic -- synthetize logic functions with 4 inputs
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+
22
+
23
+/* Can we use negated inputs? */
24
+#define noNEGATEDTOO
25
+
26
+
27
+#include <stdio.h>
28
+
29
+
30
+/*
31
+ * abcd
32
+ */
33
+
34
+#define BINARY(b15,b14,b13,b12,b11,b10,b9,b8,b7,b6,b5,b4,b3,b2,b1,b0) \
35
+  ((b15)<<15)|((b14)<<14)|((b13)<<13)|((b12)<<12)| \
36
+  ((b11)<<11)|((b10)<<10)|((b9) << 9)|((b8) << 8)| \
37
+  ((b7) << 7)|((b6) << 6)|((b5) << 5)|((b4) << 4)| \
38
+  ((b3) << 3)|((b2) << 2)|((b1) << 1)|((b0) << 0)
39
+
40
+struct fun{
41
+  int level;
42
+  int op_type;
43
+  int op1;
44
+  int op2;
45
+};
46
+
47
+struct fun db[65536];
48
+int n_fun;
49
+
50
+#define LEVEL_ALOT 1000000
51
+
52
+#define OP_FALSE 0
53
+#define OP_TRUE  1
54
+#define OP_SRC   2
55
+#define OP_AND   3
56
+#define OP_OR    4
57
+#define OP_XOR   5
58
+
59
+#define SRC_A 10
60
+#define SRC_B 20
61
+#define SRC_C 30
62
+#define SRC_D 40
63
+#define SRC_AN 11
64
+#define SRC_BN 21
65
+#define SRC_CN 31
66
+#define SRC_DN 41
67
+
68
+void dump_element_prefix(int);
69
+void dump_element_infix(int);
70
+
71
+int main(void){
72
+  int i,j;
73
+  int l,p1,p2;
74
+  int candidate;
75
+  int max_p2_lev;
76
+  
77
+  for(i=0;i<65536;i++){
78
+    db[i].level=LEVEL_ALOT;
79
+  }
80
+  n_fun=0;
81
+
82
+  db[0].level=0;
83
+  db[0].op_type=OP_FALSE;
84
+  n_fun++;
85
+
86
+  db[65535].level=0;
87
+  db[65535].op_type=OP_TRUE;
88
+  n_fun++;
89
+
90
+  db[BINARY(0,0,0,0, 0,0,0,0,  1,1,1,1, 1,1,1,1)].level=0;
91
+  db[BINARY(0,0,0,0, 0,0,0,0,  1,1,1,1, 1,1,1,1)].op_type=OP_SRC;
92
+  db[BINARY(0,0,0,0, 0,0,0,0,  1,1,1,1, 1,1,1,1)].op1=SRC_A;
93
+  n_fun++;
94
+
95
+  db[BINARY(0,0,0,0, 1,1,1,1,  0,0,0,0, 1,1,1,1)].level=0;
96
+  db[BINARY(0,0,0,0, 1,1,1,1,  0,0,0,0, 1,1,1,1)].op_type=OP_SRC;
97
+  db[BINARY(0,0,0,0, 1,1,1,1,  0,0,0,0, 1,1,1,1)].op1=SRC_B;
98
+  n_fun++;
99
+
100
+  db[BINARY(0,0,1,1, 0,0,1,1,  0,0,1,1, 0,0,1,1)].level=0;
101
+  db[BINARY(0,0,1,1, 0,0,1,1,  0,0,1,1, 0,0,1,1)].op_type=OP_SRC;
102
+  db[BINARY(0,0,1,1, 0,0,1,1,  0,0,1,1, 0,0,1,1)].op1=SRC_C;
103
+  n_fun++;
104
+
105
+  db[BINARY(0,1,0,1, 0,1,0,1,  0,1,0,1, 0,1,0,1)].level=0;
106
+  db[BINARY(0,1,0,1, 0,1,0,1,  0,1,0,1, 0,1,0,1)].op_type=OP_SRC;
107
+  db[BINARY(0,1,0,1, 0,1,0,1,  0,1,0,1, 0,1,0,1)].op1=SRC_D;
108
+  n_fun++;
109
+#ifdef NEGATEDTOO
110
+  db[BINARY(1,1,1,1, 1,1,1,1,  0,0,0,0, 0,0,0,0)].level=0;
111
+  db[BINARY(1,1,1,1, 1,1,1,1,  0,0,0,0, 0,0,0,0)].op_type=OP_SRC;
112
+  db[BINARY(1,1,1,1, 1,1,1,1,  0,0,0,0, 0,0,0,0)].op1=SRC_AN;
113
+  n_fun++;
114
+
115
+  db[BINARY(1,1,1,1, 0,0,0,0,  1,1,1,1, 0,0,0,0)].level=0;
116
+  db[BINARY(1,1,1,1, 0,0,0,0,  1,1,1,1, 0,0,0,0)].op_type=OP_SRC;
117
+  db[BINARY(1,1,1,1, 0,0,0,0,  1,1,1,1, 0,0,0,0)].op1=SRC_BN;
118
+  n_fun++;
119
+
120
+  db[BINARY(1,1,0,0, 1,1,0,0,  1,1,0,0, 1,1,0,0)].level=0;
121
+  db[BINARY(1,1,0,0, 1,1,0,0,  1,1,0,0, 1,1,0,0)].op_type=OP_SRC;
122
+  db[BINARY(1,1,0,0, 1,1,0,0,  1,1,0,0, 1,1,0,0)].op1=SRC_CN;
123
+  n_fun++;
124
+
125
+  db[BINARY(1,0,1,0, 1,0,1,0,  1,0,1,0, 1,0,1,0)].level=0;
126
+  db[BINARY(1,0,1,0, 1,0,1,0,  1,0,1,0, 1,0,1,0)].op_type=OP_SRC;
127
+  db[BINARY(1,0,1,0, 1,0,1,0,  1,0,1,0, 1,0,1,0)].op1=SRC_DN;
128
+  n_fun++;
129
+#endif
130
+
131
+  for(l=0;l<100;l++){
132
+    printf("calculating level %i\n",l);
133
+    for(p1=1;p1<65536;p1++){
134
+      if(db[p1].level==LEVEL_ALOT) continue;
135
+      max_p2_lev=l-db[p1].level-1;
136
+      for(p2=p1+1;p2<65536;p2++){
137
+        if(db[p2].level>max_p2_lev) continue;
138
+
139
+        candidate=p1&p2;
140
+        if(db[candidate].level==LEVEL_ALOT){
141
+          //found new
142
+          db[candidate].level=db[p1].level+db[p2].level+1;
143
+          db[candidate].op_type=OP_AND;
144
+          db[candidate].op1=p1;
145
+          db[candidate].op2=p2;
146
+          n_fun++;
147
+	}
148
+
149
+        candidate=p1|p2;
150
+        if(db[candidate].level==LEVEL_ALOT){
151
+          //found new
152
+          db[candidate].level=db[p1].level+db[p2].level+1;
153
+          db[candidate].op_type=OP_OR;
154
+          db[candidate].op1=p1;
155
+          db[candidate].op2=p2;
156
+          n_fun++;
157
+	}
158
+
159
+        candidate=p1^p2;
160
+        if(db[candidate].level==LEVEL_ALOT){
161
+          //found new
162
+          db[candidate].level=db[p1].level+db[p2].level+1;
163
+          db[candidate].op_type=OP_XOR;
164
+          db[candidate].op1=p1;
165
+          db[candidate].op2=p2;
166
+          n_fun++;
167
+	}
168
+
169
+      }
170
+    }
171
+    printf("num fun=%i\n\n",n_fun);
172
+    fflush(stdout);
173
+    if(n_fun>=65536) break;
174
+  }
175
+
176
+
177
+  for(i=0;i<65536;i++){
178
+    if(db[i].level==LEVEL_ALOT) continue;
179
+
180
+    printf("PREFIX ");
181
+    for(j=15;j>=0;j--){
182
+      printf("%i",i&(1<<j)?1:0);
183
+      if(j%4==0) printf(" ");
184
+      if(j%8==0) printf(" ");
185
+    }
186
+    printf(" : lev %2i: ",db[i].level);
187
+    dump_element_prefix(i);
188
+    printf("\n");
189
+
190
+    printf("INFIX  ");
191
+    for(j=15;j>=0;j--){
192
+      printf("%i",i&(1<<j)?1:0);
193
+      if(j%4==0) printf(" ");
194
+      if(j%8==0) printf(" ");
195
+    }
196
+    printf(" : lev %2i: ",db[i].level);
197
+    dump_element_infix(i);
198
+    printf("\n");
199
+  }
200
+  
201
+  return 0;
202
+}
203
+
204
+void dump_element_prefix(int e){
205
+  if(db[e].level==LEVEL_ALOT){
206
+    printf("PANIC!\n");
207
+    return;
208
+  };
209
+  switch(db[e].op_type){
210
+  case OP_FALSE:
211
+    printf("0");
212
+    break;
213
+  case OP_TRUE:
214
+    printf("1");
215
+    break;
216
+  case OP_SRC:
217
+    switch(db[e].op1){
218
+    case SRC_A:
219
+      printf("a");
220
+      break;
221
+    case SRC_B:
222
+      printf("b");
223
+      break;
224
+    case SRC_C:
225
+      printf("c");
226
+      break;
227
+    case SRC_D:
228
+      printf("d");
229
+      break;
230
+    case SRC_AN:
231
+      printf("an");
232
+      break;
233
+    case SRC_BN:
234
+      printf("bn");
235
+      break;
236
+    case SRC_CN:
237
+      printf("cn");
238
+      break;
239
+    case SRC_DN:
240
+      printf("dn");
241
+      break;
242
+    }
243
+    break;
244
+  case OP_AND:
245
+    printf("FFAND(");
246
+    dump_element_prefix(db[e].op1);
247
+    printf(",");
248
+    dump_element_prefix(db[e].op2);
249
+    printf(")");
250
+    break;
251
+  case OP_OR:
252
+    printf("FFOR(");
253
+    dump_element_prefix(db[e].op1);
254
+    printf(",");
255
+    dump_element_prefix(db[e].op2);
256
+    printf(")");
257
+    break;
258
+  case OP_XOR:
259
+    printf("FFXOR(");
260
+    dump_element_prefix(db[e].op1);
261
+    printf(",");
262
+    dump_element_prefix(db[e].op2);
263
+    printf(")");
264
+    break;
265
+  }
266
+}
267
+
268
+void dump_element_infix(int e){
269
+  if(db[e].level==LEVEL_ALOT){
270
+    printf("PANIC!\n");
271
+    return;
272
+  };
273
+  switch(db[e].op_type){
274
+  case OP_FALSE:
275
+    printf("0");
276
+    break;
277
+  case OP_TRUE:
278
+    printf("1");
279
+    break;
280
+  case OP_SRC:
281
+    switch(db[e].op1){
282
+    case SRC_A:
283
+      printf("a");
284
+      break;
285
+    case SRC_B:
286
+      printf("b");
287
+      break;
288
+    case SRC_C:
289
+      printf("c");
290
+      break;
291
+    case SRC_D:
292
+      printf("d");
293
+      break;
294
+    case SRC_AN:
295
+      printf("an");
296
+      break;
297
+    case SRC_BN:
298
+      printf("bn");
299
+      break;
300
+    case SRC_CN:
301
+      printf("cn");
302
+      break;
303
+    case SRC_DN:
304
+      printf("dn");
305
+      break;
306
+    }
307
+    break;
308
+  case OP_AND:
309
+    printf("( ");
310
+    dump_element_infix(db[e].op1);
311
+    printf("&");
312
+    dump_element_infix(db[e].op2);
313
+    printf(" )");
314
+    break;
315
+  case OP_OR:
316
+    printf("( ");
317
+    dump_element_infix(db[e].op1);
318
+    printf("|");
319
+    dump_element_infix(db[e].op2);
320
+    printf(" )");
321
+    break;
322
+  case OP_XOR:
323
+    printf("( ");
324
+    dump_element_infix(db[e].op1);
325
+    printf("^");
326
+    dump_element_infix(db[e].op2);
327
+    printf(" )");
328
+    break;
329
+  }
330
+}

+ 206
- 0
FFdecsa/parallel_032_4char.h View File

@@ -0,0 +1,206 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned char s1,s2,s3,s4;
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 32
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  res.s1=0x0;
31
+  res.s2=0x0;
32
+  res.s3=0x0;
33
+  res.s4=0x0;
34
+  return res;
35
+}
36
+
37
+group static inline FF1(){
38
+  group res;
39
+  res.s1=0xff;
40
+  res.s2=0xff;
41
+  res.s3=0xff;
42
+  res.s4=0xff;
43
+  return res;
44
+}
45
+
46
+group static inline FFAND(group a,group b){
47
+  group res;
48
+  res.s1=a.s1&b.s1;
49
+  res.s2=a.s2&b.s2;
50
+  res.s3=a.s3&b.s3;
51
+  res.s4=a.s4&b.s4;
52
+  return res;
53
+}
54
+
55
+group static inline FFOR(group a,group b){
56
+  group res;
57
+  res.s1=a.s1|b.s1;
58
+  res.s2=a.s2|b.s2;
59
+  res.s3=a.s3|b.s3;
60
+  res.s4=a.s4|b.s4;
61
+  return res;
62
+}
63
+
64
+group static inline FFXOR(group a,group b){
65
+  group res;
66
+  res.s1=a.s1^b.s1;
67
+  res.s2=a.s2^b.s2;
68
+  res.s3=a.s3^b.s3;
69
+  res.s4=a.s4^b.s4;
70
+  return res;
71
+}
72
+
73
+group static inline FFNOT(group a){
74
+  group res;
75
+  res.s1=~a.s1;
76
+  res.s2=~a.s2;
77
+  res.s3=~a.s3;
78
+  res.s4=~a.s4;
79
+  return res;
80
+}
81
+
82
+
83
+/* 64 rows of 32 bits */
84
+
85
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
86
+  *(((int *)tab)+g)=*((int *)data);
87
+  *(((int *)tab)+32+g)=*(((int *)data)+1);
88
+}
89
+
90
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
91
+  *((int *)data)=*(((int *)tab)+g);
92
+  *(((int *)data)+1)=*(((int *)tab)+32+g);
93
+}
94
+
95
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
96
+  int j;
97
+  for(j=0;j<n;j++){
98
+    *(data+j)^=*(tab+4*(g+(j>=4?32-1:0))+j);
99
+  }
100
+}
101
+
102
+struct batch_t{
103
+  unsigned char s1,s2,s3,s4;
104
+};
105
+typedef struct batch_t batch;
106
+
107
+#define BYTES_PER_BATCH 4
108
+
109
+batch static inline B_FFAND(batch a,batch b){
110
+  batch res;
111
+  res.s1=a.s1&b.s1;
112
+  res.s2=a.s2&b.s2;
113
+  res.s3=a.s3&b.s3;
114
+  res.s4=a.s4&b.s4;
115
+  return res;
116
+}
117
+
118
+batch static inline B_FFOR(batch a,batch b){
119
+  batch res;
120
+  res.s1=a.s1|b.s1;
121
+  res.s2=a.s2|b.s2;
122
+  res.s3=a.s3|b.s3;
123
+  res.s4=a.s4|b.s4;
124
+  return res;
125
+}
126
+
127
+batch static inline B_FFXOR(batch a,batch b){
128
+  batch res;
129
+  res.s1=a.s1^b.s1;
130
+  res.s2=a.s2^b.s2;
131
+  res.s3=a.s3^b.s3;
132
+  res.s4=a.s4^b.s4;
133
+  return res;
134
+}
135
+
136
+
137
+batch static inline B_FFN_ALL_29(){
138
+  batch res;
139
+  res.s1=0x29;
140
+  res.s2=0x29;
141
+  res.s3=0x29;
142
+  res.s4=0x29;
143
+  return res;
144
+}
145
+batch static inline B_FFN_ALL_02(){
146
+  batch res;
147
+  res.s1=0x02;
148
+  res.s2=0x02;
149
+  res.s3=0x02;
150
+  res.s4=0x02;
151
+  return res;
152
+}
153
+batch static inline B_FFN_ALL_04(){
154
+  batch res;
155
+  res.s1=0x04;
156
+  res.s2=0x04;
157
+  res.s3=0x04;
158
+  res.s4=0x04;
159
+  return res;
160
+}
161
+batch static inline B_FFN_ALL_10(){
162
+  batch res;
163
+  res.s1=0x10;
164
+  res.s2=0x10;
165
+  res.s3=0x10;
166
+  res.s4=0x10;
167
+  return res;
168
+}
169
+batch static inline B_FFN_ALL_40(){
170
+  batch res;
171
+  res.s1=0x40;
172
+  res.s2=0x40;
173
+  res.s3=0x40;
174
+  res.s4=0x40;
175
+  return res;
176
+}
177
+batch static inline B_FFN_ALL_80(){
178
+  batch res;
179
+  res.s1=0x80;
180
+  res.s2=0x80;
181
+  res.s3=0x80;
182
+  res.s4=0x80;
183
+  return res;
184
+}
185
+
186
+batch static inline B_FFSH8L(batch a,int n){
187
+  batch res;
188
+  res.s1=a.s1<<n;
189
+  res.s2=a.s2<<n;
190
+  res.s3=a.s3<<n;
191
+  res.s4=a.s4<<n;
192
+  return res;
193
+}
194
+
195
+batch static inline B_FFSH8R(batch a,int n){
196
+  batch res;
197
+  res.s1=a.s1>>n;
198
+  res.s2=a.s2>>n;
199
+  res.s3=a.s3>>n;
200
+  res.s4=a.s4>>n;
201
+  return res;
202
+}
203
+
204
+
205
+void static inline M_EMPTY(void){
206
+}

+ 171
- 0
FFdecsa/parallel_032_4charA.h View File

@@ -0,0 +1,171 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned char s1[4];
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 32
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  int i;
31
+  for(i=0;i<4;i++) res.s1[i]=0x0;
32
+  return res;
33
+}
34
+
35
+group static inline FF1(){
36
+  group res;
37
+  int i;
38
+  for(i=0;i<4;i++) res.s1[i]=0xff;
39
+  return res;
40
+}
41
+
42
+group static inline FFAND(group a,group b){
43
+  group res;
44
+  int i;
45
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]&b.s1[i];
46
+  return res;
47
+}
48
+
49
+group static inline FFOR(group a,group b){
50
+  group res;
51
+  int i;
52
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]|b.s1[i];
53
+  return res;
54
+}
55
+
56
+group static inline FFXOR(group a,group b){
57
+  group res;
58
+  int i;
59
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]^b.s1[i];
60
+  return res;
61
+}
62
+
63
+group static inline FFNOT(group a){
64
+  group res;
65
+  int i;
66
+  for(i=0;i<4;i++) res.s1[i]=~a.s1[i];
67
+  return res;
68
+}
69
+
70
+
71
+/* 64 rows of 32 bits */
72
+
73
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
74
+  *(((int *)tab)+g)=*((int *)data);
75
+  *(((int *)tab)+32+g)=*(((int *)data)+1);
76
+}
77
+
78
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
79
+  *((int *)data)=*(((int *)tab)+g);
80
+  *(((int *)data)+1)=*(((int *)tab)+32+g);
81
+}
82
+
83
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
84
+  int j;
85
+  for(j=0;j<n;j++){
86
+    *(data+j)^=*(tab+4*(g+(j>=4?32-1:0))+j);
87
+  }
88
+}
89
+
90
+struct batch_t{
91
+  unsigned char s1[4];
92
+};
93
+typedef struct batch_t batch;
94
+
95
+#define BYTES_PER_BATCH 4
96
+
97
+batch static inline B_FFAND(batch a,batch b){
98
+  batch res;
99
+  int i;
100
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]&b.s1[i];
101
+  return res;
102
+}
103
+
104
+batch static inline B_FFOR(batch a,batch b){
105
+  batch res;
106
+  int i;
107
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]|b.s1[i];
108
+  return res;
109
+}
110
+
111
+batch static inline B_FFXOR(batch a,batch b){
112
+  batch res;
113
+  int i;
114
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]^b.s1[i];
115
+  return res;
116
+}
117
+
118
+
119
+batch static inline B_FFN_ALL_29(){
120
+  batch res;
121
+  int i;
122
+  for(i=0;i<4;i++) res.s1[i]=0x29;
123
+  return res;
124
+}
125
+batch static inline B_FFN_ALL_02(){
126
+  batch res;
127
+  int i;
128
+  for(i=0;i<4;i++) res.s1[i]=0x02;
129
+  return res;
130
+}
131
+batch static inline B_FFN_ALL_04(){
132
+  batch res;
133
+  int i;
134
+  for(i=0;i<4;i++) res.s1[i]=0x04;
135
+  return res;
136
+}
137
+batch static inline B_FFN_ALL_10(){
138
+  batch res;
139
+  int i;
140
+  for(i=0;i<4;i++) res.s1[i]=0x10;
141
+  return res;
142
+}
143
+batch static inline B_FFN_ALL_40(){
144
+  batch res;
145
+  int i;
146
+  for(i=0;i<4;i++) res.s1[i]=0x40;
147
+  return res;
148
+}
149
+batch static inline B_FFN_ALL_80(){
150
+  batch res;
151
+  int i;
152
+  for(i=0;i<4;i++) res.s1[i]=0x80;
153
+  return res;
154
+}
155
+
156
+batch static inline B_FFSH8L(batch a,int n){
157
+  batch res;
158
+  int i;
159
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]<<n;
160
+  return res;
161
+}
162
+
163
+batch static inline B_FFSH8R(batch a,int n){
164
+  batch res;
165
+  int i;
166
+  for(i=0;i<4;i++) res.s1[i]=a.s1[i]>>n;
167
+  return res;
168
+}
169
+
170
+void static inline M_EMPTY(void){
171
+}

+ 55
- 0
FFdecsa/parallel_032_int.h View File

@@ -0,0 +1,55 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+#include "parallel_std_def.h"
21
+
22
+typedef unsigned int group;
23
+#define GROUP_PARALLELISM 32
24
+#define FF0()      0x0
25
+#define FF1()      0xffffffff
26
+
27
+/* 64 rows of 32 bits */
28
+
29
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
30
+  *(((int *)tab)+g)=*((int *)data);
31
+  *(((int *)tab)+32+g)=*(((int *)data)+1);
32
+}
33
+
34
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
35
+  *((int *)data)=*(((int *)tab)+g);
36
+  *(((int *)data)+1)=*(((int *)tab)+32+g);
37
+}
38
+
39
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
40
+  int j;
41
+  for(j=0;j<n;j++){
42
+    *(data+j)^=*(tab+4*(g+(j>=4?32-1:0))+j);
43
+  }
44
+}
45
+
46
+typedef unsigned int batch;
47
+#define BYTES_PER_BATCH 4
48
+#define B_FFN_ALL_29() 0x29292929
49
+#define B_FFN_ALL_02() 0x02020202
50
+#define B_FFN_ALL_04() 0x04040404
51
+#define B_FFN_ALL_10() 0x10101010
52
+#define B_FFN_ALL_40() 0x40404040
53
+#define B_FFN_ALL_80() 0x80808080
54
+
55
+#define M_EMPTY()

+ 175
- 0
FFdecsa/parallel_064_2int.h View File

@@ -0,0 +1,175 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned int s1;
23
+  unsigned int s2;
24
+};
25
+typedef struct group_t group;
26
+
27
+#define GROUP_PARALLELISM 64
28
+
29
+group static inline FF0(){
30
+  group res;
31
+  res.s1=0x0;
32
+  res.s2=0x0;
33
+  return res;
34
+}
35
+
36
+group static inline FF1(){
37
+  group res;
38
+  res.s1=0xffffffff;
39
+  res.s2=0xffffffff;
40
+  return res;
41
+}
42
+
43
+group static inline FFAND(group a,group b){
44
+  group res;
45
+  res.s1=a.s1&b.s1;
46
+  res.s2=a.s2&b.s2;
47
+  return res;
48
+}
49
+
50
+group static inline FFOR(group a,group b){
51
+  group res;
52
+  res.s1=a.s1|b.s1;
53
+  res.s2=a.s2|b.s2;
54
+  return res;
55
+}
56
+
57
+group static inline FFXOR(group a,group b){
58
+  group res;
59
+  res.s1=a.s1^b.s1;
60
+  res.s2=a.s2^b.s2;
61
+  return res;
62
+}
63
+
64
+group static inline FFNOT(group a){
65
+  group res;
66
+  res.s1=~a.s1;
67
+  res.s2=~a.s2;
68
+  return res;
69
+}
70
+
71
+
72
+/* 64 rows of 64 bits */
73
+
74
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
75
+  *(((int *)tab)+2*g)=*((int *)data);
76
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
77
+}
78
+
79
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
80
+  *((int *)data)=*(((int *)tab)+2*g);
81
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
82
+}
83
+
84
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
85
+  int j;
86
+  for(j=0;j<n;j++){
87
+    *(data+j)^=*(tab+8*g+j);
88
+  }
89
+}
90
+
91
+struct batch_t{
92
+  unsigned int s1;
93
+  unsigned int s2;
94
+};
95
+typedef struct batch_t batch;
96
+
97
+#define BYTES_PER_BATCH 8
98
+
99
+batch static inline B_FFAND(batch a,batch b){
100
+  batch res;
101
+  res.s1=a.s1&b.s1;
102
+  res.s2=a.s2&b.s2;
103
+  return res;
104
+}
105
+
106
+batch static inline B_FFOR(batch a,batch b){
107
+  batch res;
108
+  res.s1=a.s1|b.s1;
109
+  res.s2=a.s2|b.s2;
110
+  return res;
111
+}
112
+
113
+batch static inline B_FFXOR(batch a,batch b){
114
+  batch res;
115
+  res.s1=a.s1^b.s1;
116
+  res.s2=a.s2^b.s2;
117
+  return res;
118
+}
119
+
120
+
121
+batch static inline B_FFN_ALL_29(){
122
+  batch res;
123
+  res.s1=0x29292929;
124
+  res.s2=0x29292929;
125
+  return res;
126
+}
127
+batch static inline B_FFN_ALL_02(){
128
+  batch res;
129
+  res.s1=0x02020202;
130
+  res.s2=0x02020202;
131
+  return res;
132
+}
133
+batch static inline B_FFN_ALL_04(){
134
+  batch res;
135
+  res.s1=0x04040404;
136
+  res.s2=0x04040404;
137
+  return res;
138
+}
139
+batch static inline B_FFN_ALL_10(){
140
+  batch res;
141
+  res.s1=0x10101010;
142
+  res.s2=0x10101010;
143
+  return res;
144
+}
145
+batch static inline B_FFN_ALL_40(){
146
+  batch res;
147
+  res.s1=0x40404040;
148
+  res.s2=0x40404040;
149
+  return res;
150
+}
151
+batch static inline B_FFN_ALL_80(){
152
+  batch res;
153
+  res.s1=0x80808080;
154
+  res.s2=0x80808080;
155
+  return res;
156
+}
157
+
158
+
159
+batch static inline B_FFSH8L(batch a,int n){
160
+  batch res;
161
+  res.s1=a.s1<<n;
162
+  res.s2=a.s2<<n;
163
+  return res;
164
+}
165
+
166
+batch static inline B_FFSH8R(batch a,int n){
167
+  batch res;
168
+  res.s1=a.s1>>n;
169
+  res.s2=a.s2>>n;
170
+  return res;
171
+}
172
+
173
+
174
+void static inline M_EMPTY(void){
175
+}

+ 274
- 0
FFdecsa/parallel_064_8char.h View File

@@ -0,0 +1,274 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned char s1,s2,s3,s4,s5,s6,s7,s8;
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 64
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  res.s1=0x0;
31
+  res.s2=0x0;
32
+  res.s3=0x0;
33
+  res.s4=0x0;
34
+  res.s5=0x0;
35
+  res.s6=0x0;
36
+  res.s7=0x0;
37
+  res.s8=0x0;
38
+  return res;
39
+}
40
+
41
+group static inline FF1(){
42
+  group res;
43
+  res.s1=0xff;
44
+  res.s2=0xff;
45
+  res.s3=0xff;
46
+  res.s4=0xff;
47
+  res.s5=0xff;
48
+  res.s6=0xff;
49
+  res.s7=0xff;
50
+  res.s8=0xff;
51
+  return res;
52
+}
53
+
54
+group static inline FFAND(group a,group b){
55
+  group res;
56
+  res.s1=a.s1&b.s1;
57
+  res.s2=a.s2&b.s2;
58
+  res.s3=a.s3&b.s3;
59
+  res.s4=a.s4&b.s4;
60
+  res.s5=a.s5&b.s5;
61
+  res.s6=a.s6&b.s6;
62
+  res.s7=a.s7&b.s7;
63
+  res.s8=a.s8&b.s8;
64
+  return res;
65
+}
66
+
67
+group static inline FFOR(group a,group b){
68
+  group res;
69
+  res.s1=a.s1|b.s1;
70
+  res.s2=a.s2|b.s2;
71
+  res.s3=a.s3|b.s3;
72
+  res.s4=a.s4|b.s4;
73
+  res.s5=a.s5|b.s5;
74
+  res.s6=a.s6|b.s6;
75
+  res.s7=a.s7|b.s7;
76
+  res.s8=a.s8|b.s8;
77
+  return res;
78
+}
79
+
80
+group static inline FFXOR(group a,group b){
81
+  group res;
82
+  res.s1=a.s1^b.s1;
83
+  res.s2=a.s2^b.s2;
84
+  res.s3=a.s3^b.s3;
85
+  res.s4=a.s4^b.s4;
86
+  res.s5=a.s5^b.s5;
87
+  res.s6=a.s6^b.s6;
88
+  res.s7=a.s7^b.s7;
89
+  res.s8=a.s8^b.s8;
90
+  return res;
91
+}
92
+
93
+group static inline FFNOT(group a){
94
+  group res;
95
+  res.s1=~a.s1;
96
+  res.s2=~a.s2;
97
+  res.s3=~a.s3;
98
+  res.s4=~a.s4;
99
+  res.s5=~a.s5;
100
+  res.s6=~a.s6;
101
+  res.s7=~a.s7;
102
+  res.s8=~a.s8;
103
+  return res;
104
+}
105
+
106
+
107
+/* 64 rows of 64 bits */
108
+
109
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
110
+  *(((int *)tab)+2*g)=*((int *)data);
111
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
112
+}
113
+
114
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
115
+  *((int *)data)=*(((int *)tab)+2*g);
116
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
117
+}
118
+
119
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
120
+  int j;
121
+  for(j=0;j<n;j++){
122
+    *(data+j)^=*(tab+8*g+j);
123
+  }
124
+}
125
+
126
+struct batch_t{
127
+  unsigned char s1,s2,s3,s4,s5,s6,s7,s8;
128
+};
129
+typedef struct batch_t batch;
130
+
131
+#define BYTES_PER_BATCH 8
132
+
133
+batch static inline B_FFAND(batch a,batch b){
134
+  batch res;
135
+  res.s1=a.s1&b.s1;
136
+  res.s2=a.s2&b.s2;
137
+  res.s3=a.s3&b.s3;
138
+  res.s4=a.s4&b.s4;
139
+  res.s5=a.s5&b.s5;
140
+  res.s6=a.s6&b.s6;
141
+  res.s7=a.s7&b.s7;
142
+  res.s8=a.s8&b.s8;
143
+  return res;
144
+}
145
+
146
+batch static inline B_FFOR(batch a,batch b){
147
+  batch res;
148
+  res.s1=a.s1|b.s1;
149
+  res.s2=a.s2|b.s2;
150
+  res.s3=a.s3|b.s3;
151
+  res.s4=a.s4|b.s4;
152
+  res.s5=a.s5|b.s5;
153
+  res.s6=a.s6|b.s6;
154
+  res.s7=a.s7|b.s7;
155
+  res.s8=a.s8|b.s8;
156
+  return res;
157
+}
158
+
159
+batch static inline B_FFXOR(batch a,batch b){
160
+  batch res;
161
+  res.s1=a.s1^b.s1;
162
+  res.s2=a.s2^b.s2;
163
+  res.s3=a.s3^b.s3;
164
+  res.s4=a.s4^b.s4;
165
+  res.s5=a.s5^b.s5;
166
+  res.s6=a.s6^b.s6;
167
+  res.s7=a.s7^b.s7;
168
+  res.s8=a.s8^b.s8;
169
+  return res;
170
+}
171
+
172
+
173
+batch static inline B_FFN_ALL_29(){
174
+  batch res;
175
+  res.s1=0x29;
176
+  res.s2=0x29;
177
+  res.s3=0x29;
178
+  res.s4=0x29;
179
+  res.s5=0x29;
180
+  res.s6=0x29;
181
+  res.s7=0x29;
182
+  res.s8=0x29;
183
+  return res;
184
+}
185
+batch static inline B_FFN_ALL_02(){
186
+  batch res;
187
+  res.s1=0x02;
188
+  res.s2=0x02;
189
+  res.s3=0x02;
190
+  res.s4=0x02;
191
+  res.s5=0x02;
192
+  res.s6=0x02;
193
+  res.s7=0x02;
194
+  res.s8=0x02;
195
+  return res;
196
+}
197
+batch static inline B_FFN_ALL_04(){
198
+  batch res;
199
+  res.s1=0x04;
200
+  res.s2=0x04;
201
+  res.s3=0x04;
202
+  res.s4=0x04;
203
+  res.s5=0x04;
204
+  res.s6=0x04;
205
+  res.s7=0x04;
206
+  res.s8=0x04;
207
+  return res;
208
+}
209
+batch static inline B_FFN_ALL_10(){
210
+  batch res;
211
+  res.s1=0x10;
212
+  res.s2=0x10;
213
+  res.s3=0x10;
214
+  res.s4=0x10;
215
+  res.s5=0x10;
216
+  res.s6=0x10;
217
+  res.s7=0x10;
218
+  res.s8=0x10;
219
+  return res;
220
+}
221
+batch static inline B_FFN_ALL_40(){
222
+  batch res;
223
+  res.s1=0x40;
224
+  res.s2=0x40;
225
+  res.s3=0x40;
226
+  res.s4=0x40;
227
+  res.s5=0x40;
228
+  res.s6=0x40;
229
+  res.s7=0x40;
230
+  res.s8=0x40;
231
+  return res;
232
+}
233
+batch static inline B_FFN_ALL_80(){
234
+  batch res;
235
+  res.s1=0x80;
236
+  res.s2=0x80;
237
+  res.s3=0x80;
238
+  res.s4=0x80;
239
+  res.s5=0x80;
240
+  res.s6=0x80;
241
+  res.s7=0x80;
242
+  res.s8=0x80;
243
+  return res;
244
+}
245
+
246
+batch static inline B_FFSH8L(batch a,int n){
247
+  batch res;
248
+  res.s1=a.s1<<n;
249
+  res.s2=a.s2<<n;
250
+  res.s3=a.s3<<n;
251
+  res.s4=a.s4<<n;
252
+  res.s5=a.s5<<n;
253
+  res.s6=a.s6<<n;
254
+  res.s7=a.s7<<n;
255
+  res.s8=a.s8<<n;
256
+  return res;
257
+}
258
+
259
+batch static inline B_FFSH8R(batch a,int n){
260
+  batch res;
261
+  res.s1=a.s1>>n;
262
+  res.s2=a.s2>>n;
263
+  res.s3=a.s3>>n;
264
+  res.s4=a.s4>>n;
265
+  res.s5=a.s5>>n;
266
+  res.s6=a.s6>>n;
267
+  res.s7=a.s7>>n;
268
+  res.s8=a.s8>>n;
269
+  return res;
270
+}
271
+
272
+
273
+void static inline M_EMPTY(void){
274
+}

+ 171
- 0
FFdecsa/parallel_064_8charA.h View File

@@ -0,0 +1,171 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned char s1[8];
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 64
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  int i;
31
+  for(i=0;i<8;i++) res.s1[i]=0x0;
32
+  return res;
33
+}
34
+
35
+group static inline FF1(){
36
+  group res;
37
+  int i;
38
+  for(i=0;i<8;i++) res.s1[i]=0xff;
39
+  return res;
40
+}
41
+
42
+group static inline FFAND(group a,group b){
43
+  group res;
44
+  int i;
45
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]&b.s1[i];
46
+  return res;
47
+}
48
+
49
+group static inline FFOR(group a,group b){
50
+  group res;
51
+  int i;
52
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]|b.s1[i];
53
+  return res;
54
+}
55
+
56
+group static inline FFXOR(group a,group b){
57
+  group res;
58
+  int i;
59
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]^b.s1[i];
60
+  return res;
61
+}
62
+
63
+group static inline FFNOT(group a){
64
+  group res;
65
+  int i;
66
+  for(i=0;i<8;i++) res.s1[i]=~a.s1[i];
67
+  return res;
68
+}
69
+
70
+
71
+/* 64 rows of 64 bits */
72
+
73
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
74
+  *(((int *)tab)+2*g)=*((int *)data);
75
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
76
+}
77
+
78
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
79
+  *((int *)data)=*(((int *)tab)+2*g);
80
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
81
+}
82
+
83
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
84
+  int j;
85
+  for(j=0;j<n;j++){
86
+    *(data+j)^=*(tab+8*g+j);
87
+  }
88
+}
89
+
90
+struct batch_t{
91
+  unsigned char s1[8];
92
+};
93
+typedef struct batch_t batch;
94
+
95
+#define BYTES_PER_BATCH 8
96
+
97
+batch static inline B_FFAND(batch a,batch b){
98
+  batch res;
99
+  int i;
100
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]&b.s1[i];
101
+  return res;
102
+}
103
+
104
+batch static inline B_FFOR(batch a,batch b){
105
+  batch res;
106
+  int i;
107
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]|b.s1[i];
108
+  return res;
109
+}
110
+
111
+batch static inline B_FFXOR(batch a,batch b){
112
+  batch res;
113
+  int i;
114
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]^b.s1[i];
115
+  return res;
116
+}
117
+
118
+
119
+batch static inline B_FFN_ALL_29(){
120
+  batch res;
121
+  int i;
122
+  for(i=0;i<8;i++) res.s1[i]=0x29;
123
+  return res;
124
+}
125
+batch static inline B_FFN_ALL_02(){
126
+  batch res;
127
+  int i;
128
+  for(i=0;i<8;i++) res.s1[i]=0x02;
129
+  return res;
130
+}
131
+batch static inline B_FFN_ALL_04(){
132
+  batch res;
133
+  int i;
134
+  for(i=0;i<8;i++) res.s1[i]=0x04;
135
+  return res;
136
+}
137
+batch static inline B_FFN_ALL_10(){
138
+  batch res;
139
+  int i;
140
+  for(i=0;i<8;i++) res.s1[i]=0x10;
141
+  return res;
142
+}
143
+batch static inline B_FFN_ALL_40(){
144
+  batch res;
145
+  int i;
146
+  for(i=0;i<8;i++) res.s1[i]=0x40;
147
+  return res;
148
+}
149
+batch static inline B_FFN_ALL_80(){
150
+  batch res;
151
+  int i;
152
+  for(i=0;i<8;i++) res.s1[i]=0x80;
153
+  return res;
154
+}
155
+
156
+batch static inline B_FFSH8L(batch a,int n){
157
+  batch res;
158
+  int i;
159
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]<<n;
160
+  return res;
161
+}
162
+
163
+batch static inline B_FFSH8R(batch a,int n){
164
+  batch res;
165
+  int i;
166
+  for(i=0;i<8;i++) res.s1[i]=a.s1[i]>>n;
167
+  return res;
168
+}
169
+
170
+void static inline M_EMPTY(void){
171
+}

+ 39
- 0
FFdecsa/parallel_064_long.h View File

@@ -0,0 +1,39 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2007 Dark Avenger
4
+ *               2003-2004  fatih89r
5
+ *
6
+ * This program is free software; you can redistribute it and/or modify
7
+ * it under the terms of the GNU General Public License as published by
8
+ * the Free Software Foundation; either version 2 of the License, or
9
+ * (at your option) any later version.
10
+ *
11
+ * This program is distributed in the hope that it will be useful,
12
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14
+ * GNU General Public License for more details.
15
+ *
16
+ * You should have received a copy of the GNU General Public License
17
+ * along with this program; if not, write to the Free Software
18
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
+ */
20
+
21
+#include "parallel_std_def.h"
22
+
23
+typedef unsigned long long group;
24
+#define GROUP_PARALLELISM 64
25
+#define FF0() 0x0ULL
26
+#define FF1() 0xffffffffffffffffULL
27
+
28
+typedef unsigned long long batch;
29
+#define BYTES_PER_BATCH 8
30
+#define B_FFN_ALL_29() 0x2929292929292929ULL
31
+#define B_FFN_ALL_02() 0x0202020202020202ULL
32
+#define B_FFN_ALL_04() 0x0404040404040404ULL
33
+#define B_FFN_ALL_10() 0x1010101010101010ULL
34
+#define B_FFN_ALL_40() 0x4040404040404040ULL
35
+#define B_FFN_ALL_80() 0x8080808080808080ULL
36
+
37
+#define M_EMPTY()
38
+
39
+#include "fftable.h"

+ 83
- 0
FFdecsa/parallel_064_mmx.h View File

@@ -0,0 +1,83 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2007 Dark Avenger
4
+ *               2003-2004  fatih89r
5
+ *
6
+ * This program is free software; you can redistribute it and/or modify
7
+ * it under the terms of the GNU General Public License as published by
8
+ * the Free Software Foundation; either version 2 of the License, or
9
+ * (at your option) any later version.
10
+ *
11
+ * This program is distributed in the hope that it will be useful,
12
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14
+ * GNU General Public License for more details.
15
+ *
16
+ * You should have received a copy of the GNU General Public License
17
+ * along with this program; if not, write to the Free Software
18
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
+ */
20
+
21
+#include <mmintrin.h>
22
+
23
+#define MEMALIGN __attribute__((aligned(16)))
24
+
25
+union __u64 {
26
+    unsigned int u[2];
27
+    __m64 v;
28
+};
29
+
30
+static const union __u64 ff0 = {{0x00000000U, 0x00000000U}};
31
+static const union __u64 ff1 = {{0xffffffffU, 0xffffffffU}};
32
+
33
+typedef __m64 group;
34
+#define GROUP_PARALLELISM 64
35
+#define FF0()      ff0.v
36
+#define FF1()      ff1.v
37
+#define FFAND(a,b) _mm_and_si64((a),(b))
38
+#define FFOR(a,b)  _mm_or_si64((a),(b))
39
+#define FFXOR(a,b) _mm_xor_si64((a),(b))
40
+#define FFNOT(a)   _mm_xor_si64((a),FF1())
41
+
42
+/* 64 rows of 64 bits */
43
+
44
+static const union __u64 ff29 = {{0x29292929U, 0x29292929U}};
45
+static const union __u64 ff02 = {{0x02020202U, 0x02020202U}};
46
+static const union __u64 ff04 = {{0x04040404U, 0x04040404U}};
47
+static const union __u64 ff10 = {{0x10101010U, 0x10101010U}};
48
+static const union __u64 ff40 = {{0x40404040U, 0x40404040U}};
49
+static const union __u64 ff80 = {{0x80808080U, 0x80808080U}};
50
+
51
+typedef __m64 batch;
52
+#define BYTES_PER_BATCH 8
53
+#define B_FFAND(a,b) FFAND((a),(b))
54
+#define B_FFOR(a,b)  FFOR((a),(b))
55
+#define B_FFXOR(a,b) FFXOR((a),(b))
56
+#define B_FFN_ALL_29() ff29.v
57
+#define B_FFN_ALL_02() ff02.v
58
+#define B_FFN_ALL_04() ff04.v
59
+#define B_FFN_ALL_10() ff10.v
60
+#define B_FFN_ALL_40() ff40.v
61
+#define B_FFN_ALL_80() ff80.v
62
+#define B_FFSH8L(a,n) _mm_slli_si64((a),(n))
63
+#define B_FFSH8R(a,n) _mm_srli_si64((a),(n))
64
+
65
+#define M_EMPTY() _mm_empty()
66
+
67
+
68
+#undef XOR_8_BY
69
+#define XOR_8_BY(d,s1,s2)    do { *(__m64*)d = _mm_xor_si64(*(__m64*)(s1), *(__m64*)(s2)); } while(0)
70
+
71
+#undef XOREQ_8_BY
72
+#define XOREQ_8_BY(d,s)      XOR_8_BY(d, d, s)
73
+
74
+#undef COPY_8_BY
75
+#define COPY_8_BY(d,s)       do { *(__m64 *)(d) = *(__m64 *)(s); } while(0)
76
+
77
+#undef BEST_SPAN
78
+#define BEST_SPAN            8
79
+
80
+#undef XOR_BEST_BY
81
+#define XOR_BEST_BY(d,s1,s2) XOR_8_BY(d,s1,s2)
82
+
83
+#include "fftable.h"

+ 411
- 0
FFdecsa/parallel_128_16char.h View File

@@ -0,0 +1,411 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned char s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16;
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 128
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  res.s1=0x0;
31
+  res.s2=0x0;
32
+  res.s3=0x0;
33
+  res.s4=0x0;
34
+  res.s5=0x0;
35
+  res.s6=0x0;
36
+  res.s7=0x0;
37
+  res.s8=0x0;
38
+  res.s9=0x0;
39
+  res.s10=0x0;
40
+  res.s11=0x0;
41
+  res.s12=0x0;
42
+  res.s13=0x0;
43
+  res.s14=0x0;
44
+  res.s15=0x0;
45
+  res.s16=0x0;
46
+  return res;
47
+}
48
+
49
+group static inline FF1(){
50
+  group res;
51
+  res.s1=0xff;
52
+  res.s2=0xff;
53
+  res.s3=0xff;
54
+  res.s4=0xff;
55
+  res.s5=0xff;
56
+  res.s6=0xff;
57
+  res.s7=0xff;
58
+  res.s8=0xff;
59
+  res.s9=0xff;
60
+  res.s10=0xff;
61
+  res.s11=0xff;
62
+  res.s12=0xff;
63
+  res.s13=0xff;
64
+  res.s14=0xff;
65
+  res.s15=0xff;
66
+  res.s16=0xff;
67
+  return res;
68
+}
69
+
70
+group static inline FFAND(group a,group b){
71
+  group res;
72
+  res.s1=a.s1&b.s1;
73
+  res.s2=a.s2&b.s2;
74
+  res.s3=a.s3&b.s3;
75
+  res.s4=a.s4&b.s4;
76
+  res.s5=a.s5&b.s5;
77
+  res.s6=a.s6&b.s6;
78
+  res.s7=a.s7&b.s7;
79
+  res.s8=a.s8&b.s8;
80
+  res.s9=a.s9&b.s9;
81
+  res.s10=a.s10&b.s10;
82
+  res.s11=a.s11&b.s11;
83
+  res.s12=a.s12&b.s12;
84
+  res.s13=a.s13&b.s13;
85
+  res.s14=a.s14&b.s14;
86
+  res.s15=a.s15&b.s15;
87
+  res.s16=a.s16&b.s16;
88
+  return res;
89
+}
90
+
91
+group static inline FFOR(group a,group b){
92
+  group res;
93
+  res.s1=a.s1|b.s1;
94
+  res.s2=a.s2|b.s2;
95
+  res.s3=a.s3|b.s3;
96
+  res.s4=a.s4|b.s4;
97
+  res.s5=a.s5|b.s5;
98
+  res.s6=a.s6|b.s6;
99
+  res.s7=a.s7|b.s7;
100
+  res.s8=a.s8|b.s8;
101
+  res.s9=a.s9|b.s9;
102
+  res.s10=a.s10|b.s10;
103
+  res.s11=a.s11|b.s11;
104
+  res.s12=a.s12|b.s12;
105
+  res.s13=a.s13|b.s13;
106
+  res.s14=a.s14|b.s14;
107
+  res.s15=a.s15|b.s15;
108
+  res.s16=a.s16|b.s16;
109
+  return res;
110
+}
111
+
112
+group static inline FFXOR(group a,group b){
113
+  group res;
114
+  res.s1=a.s1^b.s1;
115
+  res.s2=a.s2^b.s2;
116
+  res.s3=a.s3^b.s3;
117
+  res.s4=a.s4^b.s4;
118
+  res.s5=a.s5^b.s5;
119
+  res.s6=a.s6^b.s6;
120
+  res.s7=a.s7^b.s7;
121
+  res.s8=a.s8^b.s8;
122
+  res.s9=a.s9^b.s9;
123
+  res.s10=a.s10^b.s10;
124
+  res.s11=a.s11^b.s11;
125
+  res.s12=a.s12^b.s12;
126
+  res.s13=a.s13^b.s13;
127
+  res.s14=a.s14^b.s14;
128
+  res.s15=a.s15^b.s15;
129
+  res.s16=a.s16^b.s16;
130
+  return res;
131
+}
132
+
133
+group static inline FFNOT(group a){
134
+  group res;
135
+  res.s1=~a.s1;
136
+  res.s2=~a.s2;
137
+  res.s3=~a.s3;
138
+  res.s4=~a.s4;
139
+  res.s5=~a.s5;
140
+  res.s6=~a.s6;
141
+  res.s7=~a.s7;
142
+  res.s8=~a.s8;
143
+  res.s9=~a.s9;
144
+  res.s10=~a.s10;
145
+  res.s11=~a.s11;
146
+  res.s12=~a.s12;
147
+  res.s13=~a.s13;
148
+  res.s14=~a.s14;
149
+  res.s15=~a.s15;
150
+  res.s16=~a.s16;
151
+  return res;
152
+}
153
+
154
+
155
+/* 64 rows of 128 bits */
156
+
157
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
158
+  *(((int *)tab)+2*g)=*((int *)data);
159
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
160
+}
161
+
162
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
163
+  *((int *)data)=*(((int *)tab)+2*g);
164
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
165
+}
166
+
167
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
168
+  int j;
169
+  for(j=0;j<n;j++){
170
+    *(data+j)^=*(tab+8*g+j);
171
+  }
172
+}
173
+
174
+
175
+struct batch_t{
176
+  unsigned char s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16;
177
+};
178
+typedef struct batch_t batch;
179
+
180
+#define BYTES_PER_BATCH 16
181
+
182
+batch static inline B_FFAND(batch a,batch b){
183
+  batch res;
184
+  res.s1=a.s1&b.s1;
185
+  res.s2=a.s2&b.s2;
186
+  res.s3=a.s3&b.s3;
187
+  res.s4=a.s4&b.s4;
188
+  res.s5=a.s5&b.s5;
189
+  res.s6=a.s6&b.s6;
190
+  res.s7=a.s7&b.s7;
191
+  res.s8=a.s8&b.s8;
192
+  res.s9=a.s9&b.s9;
193
+  res.s10=a.s10&b.s10;
194
+  res.s11=a.s11&b.s11;
195
+  res.s12=a.s12&b.s12;
196
+  res.s13=a.s13&b.s13;
197
+  res.s14=a.s14&b.s14;
198
+  res.s15=a.s15&b.s15;
199
+  res.s16=a.s16&b.s16;
200
+  return res;
201
+}
202
+
203
+batch static inline B_FFOR(batch a,batch b){
204
+  batch res;
205
+  res.s1=a.s1|b.s1;
206
+  res.s2=a.s2|b.s2;
207
+  res.s3=a.s3|b.s3;
208
+  res.s4=a.s4|b.s4;
209
+  res.s5=a.s5|b.s5;
210
+  res.s6=a.s6|b.s6;
211
+  res.s7=a.s7|b.s7;
212
+  res.s8=a.s8|b.s8;
213
+  res.s9=a.s9|b.s9;
214
+  res.s10=a.s10|b.s10;
215
+  res.s11=a.s11|b.s11;
216
+  res.s12=a.s12|b.s12;
217
+  res.s13=a.s13|b.s13;
218
+  res.s14=a.s14|b.s14;
219
+  res.s15=a.s15|b.s15;
220
+  res.s16=a.s16|b.s16;
221
+  return res;
222
+}
223
+
224
+batch static inline B_FFXOR(batch a,batch b){
225
+  batch res;
226
+  res.s1=a.s1^b.s1;
227
+  res.s2=a.s2^b.s2;
228
+  res.s3=a.s3^b.s3;
229
+  res.s4=a.s4^b.s4;
230
+  res.s5=a.s5^b.s5;
231
+  res.s6=a.s6^b.s6;
232
+  res.s7=a.s7^b.s7;
233
+  res.s8=a.s8^b.s8;
234
+  res.s9=a.s9^b.s9;
235
+  res.s10=a.s10^b.s10;
236
+  res.s11=a.s11^b.s11;
237
+  res.s12=a.s12^b.s12;
238
+  res.s13=a.s13^b.s13;
239
+  res.s14=a.s14^b.s14;
240
+  res.s15=a.s15^b.s15;
241
+  res.s16=a.s16^b.s16;
242
+  return res;
243
+}
244
+
245
+
246
+batch static inline B_FFN_ALL_29(){
247
+  batch res;
248
+  res.s1=0x29;
249
+  res.s2=0x29;
250
+  res.s3=0x29;
251
+  res.s4=0x29;
252
+  res.s5=0x29;
253
+  res.s6=0x29;
254
+  res.s7=0x29;
255
+  res.s8=0x29;
256
+  res.s9=0x29;
257
+  res.s10=0x29;
258
+  res.s11=0x29;
259
+  res.s12=0x29;
260
+  res.s13=0x29;
261
+  res.s14=0x29;
262
+  res.s15=0x29;
263
+  res.s16=0x29;
264
+  return res;
265
+}
266
+batch static inline B_FFN_ALL_02(){
267
+  batch res;
268
+  res.s1=0x02;
269
+  res.s2=0x02;
270
+  res.s3=0x02;
271
+  res.s4=0x02;
272
+  res.s5=0x02;
273
+  res.s6=0x02;
274
+  res.s7=0x02;
275
+  res.s8=0x02;
276
+  res.s9=0x02;
277
+  res.s10=0x02;
278
+  res.s11=0x02;
279
+  res.s12=0x02;
280
+  res.s13=0x02;
281
+  res.s14=0x02;
282
+  res.s15=0x02;
283
+  res.s16=0x02;
284
+  return res;
285
+}
286
+batch static inline B_FFN_ALL_04(){
287
+  batch res;
288
+  res.s1=0x04;
289
+  res.s2=0x04;
290
+  res.s3=0x04;
291
+  res.s4=0x04;
292
+  res.s5=0x04;
293
+  res.s6=0x04;
294
+  res.s7=0x04;
295
+  res.s8=0x04;
296
+  res.s9=0x04;
297
+  res.s10=0x04;
298
+  res.s11=0x04;
299
+  res.s12=0x04;
300
+  res.s13=0x04;
301
+  res.s14=0x04;
302
+  res.s15=0x04;
303
+  res.s16=0x04;
304
+  return res;
305
+}
306
+batch static inline B_FFN_ALL_10(){
307
+  batch res;
308
+  res.s1=0x10;
309
+  res.s2=0x10;
310
+  res.s3=0x10;
311
+  res.s4=0x10;
312
+  res.s5=0x10;
313
+  res.s6=0x10;
314
+  res.s7=0x10;
315
+  res.s8=0x10;
316
+  res.s9=0x10;
317
+  res.s10=0x10;
318
+  res.s11=0x10;
319
+  res.s12=0x10;
320
+  res.s13=0x10;
321
+  res.s14=0x10;
322
+  res.s15=0x10;
323
+  res.s16=0x10;
324
+  return res;
325
+}
326
+batch static inline B_FFN_ALL_40(){
327
+  batch res;
328
+  res.s1=0x40;
329
+  res.s2=0x40;
330
+  res.s3=0x40;
331
+  res.s4=0x40;
332
+  res.s5=0x40;
333
+  res.s6=0x40;
334
+  res.s7=0x40;
335
+  res.s8=0x40;
336
+  res.s9=0x40;
337
+  res.s10=0x40;
338
+  res.s11=0x40;
339
+  res.s12=0x40;
340
+  res.s13=0x40;
341
+  res.s14=0x40;
342
+  res.s15=0x40;
343
+  res.s16=0x40;
344
+  return res;
345
+}
346
+batch static inline B_FFN_ALL_80(){
347
+  batch res;
348
+  res.s1=0x80;
349
+  res.s2=0x80;
350
+  res.s3=0x80;
351
+  res.s4=0x80;
352
+  res.s5=0x80;
353
+  res.s6=0x80;
354
+  res.s7=0x80;
355
+  res.s8=0x80;
356
+  res.s9=0x80;
357
+  res.s10=0x80;
358
+  res.s11=0x80;
359
+  res.s12=0x80;
360
+  res.s13=0x80;
361
+  res.s14=0x80;
362
+  res.s15=0x80;
363
+  res.s16=0x80;
364
+  return res;
365
+}
366
+
367
+batch static inline B_FFSH8L(batch a,int n){
368
+  batch res;
369
+  res.s1=a.s1<<n;
370
+  res.s2=a.s2<<n;
371
+  res.s3=a.s3<<n;
372
+  res.s4=a.s4<<n;
373
+  res.s5=a.s5<<n;
374
+  res.s6=a.s6<<n;
375
+  res.s7=a.s7<<n;
376
+  res.s8=a.s8<<n;
377
+  res.s9=a.s9<<n;
378
+  res.s10=a.s10<<n;
379
+  res.s11=a.s11<<n;
380
+  res.s12=a.s12<<n;
381
+  res.s13=a.s13<<n;
382
+  res.s14=a.s14<<n;
383
+  res.s15=a.s15<<n;
384
+  res.s16=a.s16<<n;
385
+  return res;
386
+}
387
+
388
+batch static inline B_FFSH8R(batch a,int n){
389
+  batch res;
390
+  res.s1=a.s1>>n;
391
+  res.s2=a.s2>>n;
392
+  res.s3=a.s3>>n;
393
+  res.s4=a.s4>>n;
394
+  res.s5=a.s5>>n;
395
+  res.s6=a.s6>>n;
396
+  res.s7=a.s7>>n;
397
+  res.s8=a.s8>>n;
398
+  res.s9=a.s9>>n;
399
+  res.s10=a.s10>>n;
400
+  res.s11=a.s11>>n;
401
+  res.s12=a.s12>>n;
402
+  res.s13=a.s13>>n;
403
+  res.s14=a.s14>>n;
404
+  res.s15=a.s15>>n;
405
+  res.s16=a.s16>>n;
406
+  return res;
407
+}
408
+
409
+
410
+void static inline M_EMPTY(void){
411
+}

+ 172
- 0
FFdecsa/parallel_128_16charA.h View File

@@ -0,0 +1,172 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned char s1[16];
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 128
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  int i;
31
+  for(i=0;i<16;i++) res.s1[i]=0x0;
32
+  return res;
33
+}
34
+
35
+group static inline FF1(){
36
+  group res;
37
+  int i;
38
+  for(i=0;i<16;i++) res.s1[i]=0xff;
39
+  return res;
40
+}
41
+
42
+group static inline FFAND(group a,group b){
43
+  group res;
44
+  int i;
45
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]&b.s1[i];
46
+  return res;
47
+}
48
+
49
+group static inline FFOR(group a,group b){
50
+  group res;
51
+  int i;
52
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]|b.s1[i];
53
+  return res;
54
+}
55
+
56
+group static inline FFXOR(group a,group b){
57
+  group res;
58
+  int i;
59
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]^b.s1[i];
60
+  return res;
61
+}
62
+
63
+group static inline FFNOT(group a){
64
+  group res;
65
+  int i;
66
+  for(i=0;i<16;i++) res.s1[i]=~a.s1[i];
67
+  return res;
68
+}
69
+
70
+
71
+/* 64 rows of 128 bits */
72
+
73
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
74
+  *(((int *)tab)+2*g)=*((int *)data);
75
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
76
+}
77
+
78
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
79
+  *((int *)data)=*(((int *)tab)+2*g);
80
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
81
+}
82
+
83
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
84
+  int j;
85
+  for(j=0;j<n;j++){
86
+    *(data+j)^=*(tab+8*g+j);
87
+  }
88
+}
89
+
90
+
91
+struct batch_t{
92
+  unsigned char s1[16];
93
+};
94
+typedef struct batch_t batch;
95
+
96
+#define BYTES_PER_BATCH 16
97
+
98
+batch static inline B_FFAND(batch a,batch b){
99
+  batch res;
100
+  int i;
101
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]&b.s1[i];
102
+  return res;
103
+}
104
+
105
+batch static inline B_FFOR(batch a,batch b){
106
+  batch res;
107
+  int i;
108
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]|b.s1[i];
109
+  return res;
110
+}
111
+
112
+batch static inline B_FFXOR(batch a,batch b){
113
+  batch res;
114
+  int i;
115
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]^b.s1[i];
116
+  return res;
117
+}
118
+
119
+
120
+batch static inline B_FFN_ALL_29(){
121
+  batch res;
122
+  int i;
123
+  for(i=0;i<16;i++) res.s1[i]=0x29;
124
+  return res;
125
+}
126
+batch static inline B_FFN_ALL_02(){
127
+  batch res;
128
+  int i;
129
+  for(i=0;i<16;i++) res.s1[i]=0x02;
130
+  return res;
131
+}
132
+batch static inline B_FFN_ALL_04(){
133
+  batch res;
134
+  int i;
135
+  for(i=0;i<16;i++) res.s1[i]=0x04;
136
+  return res;
137
+}
138
+batch static inline B_FFN_ALL_10(){
139
+  batch res;
140
+  int i;
141
+  for(i=0;i<16;i++) res.s1[i]=0x10;
142
+  return res;
143
+}
144
+batch static inline B_FFN_ALL_40(){
145
+  batch res;
146
+  int i;
147
+  for(i=0;i<16;i++) res.s1[i]=0x40;
148
+  return res;
149
+}
150
+batch static inline B_FFN_ALL_80(){
151
+  batch res;
152
+  int i;
153
+  for(i=0;i<16;i++) res.s1[i]=0x80;
154
+  return res;
155
+}
156
+
157
+batch static inline B_FFSH8L(batch a,int n){
158
+  batch res;
159
+  int i;
160
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]<<n;
161
+  return res;
162
+}
163
+
164
+batch static inline B_FFSH8R(batch a,int n){
165
+  batch res;
166
+  int i;
167
+  for(i=0;i<16;i++) res.s1[i]=a.s1[i]>>n;
168
+  return res;
169
+}
170
+
171
+void static inline M_EMPTY(void){
172
+}

+ 175
- 0
FFdecsa/parallel_128_2long.h View File

@@ -0,0 +1,175 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned long long int s1;
23
+  unsigned long long int s2;
24
+};
25
+typedef struct group_t group;
26
+
27
+#define GROUP_PARALLELISM 128
28
+
29
+group static inline FF0(){
30
+  group res;
31
+  res.s1=0x0ULL;
32
+  res.s2=0x0ULL;
33
+  return res;
34
+}
35
+
36
+group static inline FF1(){
37
+  group res;
38
+  res.s1=0xffffffffffffffffULL;
39
+  res.s2=0xffffffffffffffffULL;
40
+  return res;
41
+}
42
+
43
+group static inline FFAND(group a,group b){
44
+  group res;
45
+  res.s1=a.s1&b.s1;
46
+  res.s2=a.s2&b.s2;
47
+  return res;
48
+}
49
+
50
+group static inline FFOR(group a,group b){
51
+  group res;
52
+  res.s1=a.s1|b.s1;
53
+  res.s2=a.s2|b.s2;
54
+  return res;
55
+}
56
+
57
+group static inline FFXOR(group a,group b){
58
+  group res;
59
+  res.s1=a.s1^b.s1;
60
+  res.s2=a.s2^b.s2;
61
+  return res;
62
+}
63
+
64
+group static inline FFNOT(group a){
65
+  group res;
66
+  res.s1=~a.s1;
67
+  res.s2=~a.s2;
68
+  return res;
69
+}
70
+
71
+
72
+/* 64 rows of 128 bits */
73
+
74
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
75
+  *(((int *)tab)+2*g)=*((int *)data);
76
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
77
+}
78
+
79
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
80
+  *((int *)data)=*(((int *)tab)+2*g);
81
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
82
+}
83
+
84
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
85
+  int j;
86
+  for(j=0;j<n;j++){
87
+    *(data+j)^=*(tab+8*g+j);
88
+  }
89
+}
90
+
91
+
92
+struct batch_t{
93
+  unsigned long long int s1;
94
+  unsigned long long int s2;
95
+};
96
+typedef struct batch_t batch;
97
+
98
+#define BYTES_PER_BATCH 16
99
+
100
+batch static inline B_FFAND(batch a,batch b){
101
+  batch res;
102
+  res.s1=a.s1&b.s1;
103
+  res.s2=a.s2&b.s2;
104
+  return res;
105
+}
106
+
107
+batch static inline B_FFOR(batch a,batch b){
108
+  batch res;
109
+  res.s1=a.s1|b.s1;
110
+  res.s2=a.s2|b.s2;
111
+  return res;
112
+}
113
+
114
+batch static inline B_FFXOR(batch a,batch b){
115
+  batch res;
116
+  res.s1=a.s1^b.s1;
117
+  res.s2=a.s2^b.s2;
118
+  return res;
119
+}
120
+
121
+
122
+batch static inline B_FFN_ALL_29(){
123
+  batch res;
124
+  res.s1=0x2929292929292929ULL;
125
+  res.s2=0x2929292929292929ULL;
126
+  return res;
127
+}
128
+
129
+batch static inline B_FFN_ALL_02(){
130
+  batch res;
131
+  res.s1=0x0202020202020202ULL;
132
+  res.s2=0x0202020202020202ULL;
133
+  return res;
134
+}
135
+batch static inline B_FFN_ALL_04(){
136
+  batch res;
137
+  res.s1=0x0404040404040404ULL;
138
+  res.s2=0x0404040404040404ULL;
139
+  return res;
140
+}
141
+batch static inline B_FFN_ALL_10(){
142
+  batch res;
143
+  res.s1=0x1010101010101010ULL;
144
+  res.s2=0x1010101010101010ULL;
145
+  return res;
146
+}
147
+batch static inline B_FFN_ALL_40(){
148
+  batch res;
149
+  res.s1=0x4040404040404040ULL;
150
+  res.s2=0x4040404040404040ULL;
151
+  return res;
152
+}
153
+batch static inline B_FFN_ALL_80(){
154
+  batch res;
155
+  res.s1=0x8080808080808080ULL;
156
+  res.s2=0x8080808080808080ULL;
157
+  return res;
158
+}
159
+batch static inline B_FFSH8L(batch a,int n){
160
+  batch res;
161
+  res.s1=a.s1<<n;
162
+  res.s2=a.s2<<n;
163
+  return res;
164
+}
165
+
166
+batch static inline B_FFSH8R(batch a,int n){
167
+  batch res;
168
+  res.s1=a.s1>>n;
169
+  res.s2=a.s2>>n;
170
+  return res;
171
+}
172
+
173
+
174
+void static inline M_EMPTY(void){
175
+}

+ 201
- 0
FFdecsa/parallel_128_2mmx.h View File

@@ -0,0 +1,201 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+#include <mmintrin.h>
22
+
23
+#define MEMALIGN __attribute__((aligned(16)))
24
+
25
+struct group_t{
26
+  __m64 s1,s2;
27
+};
28
+typedef struct group_t group;
29
+
30
+#define GROUP_PARALLELISM 128
31
+
32
+group static inline FF0(){
33
+  group res;
34
+  res.s1=(__m64)0x0ULL;
35
+  res.s2=(__m64)0x0ULL;
36
+  return res;
37
+}
38
+
39
+group static inline FF1(){
40
+  group res;
41
+  res.s1=(__m64)0xffffffffffffffffULL;
42
+  res.s2=(__m64)0xffffffffffffffffULL;
43
+  return res;
44
+}
45
+
46
+group static inline FFAND(group a,group b){
47
+  group res;
48
+  res.s1=_m_pand(a.s1,b.s1);
49
+  res.s2=_m_pand(a.s2,b.s2);
50
+  return res;
51
+}
52
+
53
+group static inline FFOR(group a,group b){
54
+  group res;
55
+  res.s1=_m_por(a.s1,b.s1);
56
+  res.s2=_m_por(a.s2,b.s2);
57
+  return res;
58
+}
59
+
60
+group static inline FFXOR(group a,group b){
61
+  group res;
62
+  res.s1=_m_pxor(a.s1,b.s1);
63
+  res.s2=_m_pxor(a.s2,b.s2);
64
+  return res;
65
+}
66
+
67
+group static inline FFNOT(group a){
68
+  group res;
69
+  res.s1=_m_pxor(a.s1,FF1().s1);
70
+  res.s2=_m_pxor(a.s2,FF1().s2);
71
+  return res;
72
+}
73
+
74
+
75
+/* 64 rows of 128 bits */
76
+
77
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
78
+  *(((int *)tab)+2*g)=*((int *)data);
79
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
80
+}
81
+
82
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
83
+  *((int *)data)=*(((int *)tab)+2*g);
84
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
85
+}
86
+
87
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
88
+  int j;
89
+  for(j=0;j<n;j++){
90
+    *(data+j)^=*(tab+8*g+j);
91
+  }
92
+}
93
+
94
+
95
+struct batch_t{
96
+  __m64 s1,s2;
97
+};
98
+typedef struct batch_t batch;
99
+
100
+#define BYTES_PER_BATCH 16
101
+
102
+batch static inline B_FFAND(batch a,batch b){
103
+  batch res;
104
+  res.s1=_m_pand(a.s1,b.s1);
105
+  res.s2=_m_pand(a.s2,b.s2);
106
+  return res;
107
+}
108
+
109
+batch static inline B_FFOR(batch a,batch b){
110
+  batch res;
111
+  res.s1=_m_por(a.s1,b.s1);
112
+  res.s2=_m_por(a.s2,b.s2);
113
+  return res;
114
+}
115
+
116
+batch static inline B_FFXOR(batch a,batch b){
117
+  batch res;
118
+  res.s1=_m_pxor(a.s1,b.s1);
119
+  res.s2=_m_pxor(a.s2,b.s2);
120
+  return res;
121
+}
122
+
123
+batch static inline B_FFN_ALL_29(){
124
+  batch res;
125
+  res.s1=(__m64)0x2929292929292929ULL;
126
+  res.s2=(__m64)0x2929292929292929ULL;
127
+  return res;
128
+}
129
+batch static inline B_FFN_ALL_02(){
130
+  batch res;
131
+  res.s1=(__m64)0x0202020202020202ULL;
132
+  res.s2=(__m64)0x0202020202020202ULL;
133
+  return res;
134
+}
135
+batch static inline B_FFN_ALL_04(){
136
+  batch res;
137
+  res.s1=(__m64)0x0404040404040404ULL;
138
+  res.s2=(__m64)0x0404040404040404ULL;
139
+  return res;
140
+}
141
+batch static inline B_FFN_ALL_10(){
142
+  batch res;
143
+  res.s1=(__m64)0x1010101010101010ULL;
144
+  res.s2=(__m64)0x1010101010101010ULL;
145
+  return res;
146
+}
147
+batch static inline B_FFN_ALL_40(){
148
+  batch res;
149
+  res.s1=(__m64)0x4040404040404040ULL;
150
+  res.s2=(__m64)0x4040404040404040ULL;
151
+  return res;
152
+}
153
+batch static inline B_FFN_ALL_80(){
154
+  batch res;
155
+  res.s1=(__m64)0x8080808080808080ULL;
156
+  res.s2=(__m64)0x8080808080808080ULL;
157
+  return res;
158
+}
159
+
160
+batch static inline B_FFSH8L(batch a,int n){
161
+  batch res;
162
+  res.s1=_m_psllqi(a.s1,n);
163
+  res.s2=_m_psllqi(a.s2,n);
164
+  return res;
165
+}
166
+
167
+batch static inline B_FFSH8R(batch a,int n){
168
+  batch res;
169
+  res.s1=_m_psrlqi(a.s1,n);
170
+  res.s2=_m_psrlqi(a.s2,n);
171
+  return res;
172
+}
173
+
174
+void static inline M_EMPTY(void){
175
+  _m_empty();
176
+}
177
+
178
+
179
+#undef XOR_8_BY
180
+#define XOR_8_BY(d,s1,s2)    do{ __m64 *pd=(__m64 *)(d), *ps1=(__m64 *)(s1), *ps2=(__m64 *)(s2); \
181
+                                 *pd = _m_pxor( *ps1 , *ps2 ); }while(0)
182
+
183
+#undef XOREQ_8_BY
184
+#define XOREQ_8_BY(d,s)      do{ __m64 *pd=(__m64 *)(d), *ps=(__m64 *)(s); \
185
+                                 *pd = _m_pxor( *ps, *pd ); }while(0)
186
+
187
+#undef COPY_8_BY
188
+#define COPY_8_BY(d,s)       do{ __m64 *pd=(__m64 *)(d), *ps=(__m64 *)(s); \
189
+                                 *pd =  *ps; }while(0)
190
+
191
+#undef BEST_SPAN
192
+#define BEST_SPAN            8
193
+
194
+#undef XOR_BEST_BY
195
+#define XOR_BEST_BY(d,s1,s2) do{ XOR_8_BY(d,s1,s2); }while(0);
196
+
197
+#undef XOREQ_BEST_BY
198
+#define XOREQ_BEST_BY(d,s)   do{ XOREQ_8_BY(d,s); }while(0);
199
+
200
+#undef COPY_BEST_BY
201
+#define COPY_BEST_BY(d,s)    do{ COPY_8_BY(d,s); }while(0);

+ 207
- 0
FFdecsa/parallel_128_4int.h View File

@@ -0,0 +1,207 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+struct group_t{
22
+  unsigned int s1,s2,s3,s4;
23
+};
24
+typedef struct group_t group;
25
+
26
+#define GROUP_PARALLELISM 128
27
+
28
+group static inline FF0(){
29
+  group res;
30
+  res.s1=0x0;
31
+  res.s2=0x0;
32
+  res.s3=0x0;
33
+  res.s4=0x0;
34
+  return res;
35
+}
36
+
37
+group static inline FF1(){
38
+  group res;
39
+  res.s1=0xffffffff;
40
+  res.s2=0xffffffff;
41
+  res.s3=0xffffffff;
42
+  res.s4=0xffffffff;
43
+  return res;
44
+}
45
+
46
+group static inline FFAND(group a,group b){
47
+  group res;
48
+  res.s1=a.s1&b.s1;
49
+  res.s2=a.s2&b.s2;
50
+  res.s3=a.s3&b.s3;
51
+  res.s4=a.s4&b.s4;
52
+  return res;
53
+}
54
+
55
+group static inline FFOR(group a,group b){
56
+  group res;
57
+  res.s1=a.s1|b.s1;
58
+  res.s2=a.s2|b.s2;
59
+  res.s3=a.s3|b.s3;
60
+  res.s4=a.s4|b.s4;
61
+  return res;
62
+}
63
+
64
+group static inline FFXOR(group a,group b){
65
+  group res;
66
+  res.s1=a.s1^b.s1;
67
+  res.s2=a.s2^b.s2;
68
+  res.s3=a.s3^b.s3;
69
+  res.s4=a.s4^b.s4;
70
+  return res;
71
+}
72
+
73
+group static inline FFNOT(group a){
74
+  group res;
75
+  res.s1=~a.s1;
76
+  res.s2=~a.s2;
77
+  res.s3=~a.s3;
78
+  res.s4=~a.s4;
79
+  return res;
80
+}
81
+
82
+
83
+/* 64 rows of 128 bits */
84
+
85
+void static inline FFTABLEIN(unsigned char *tab, int g, unsigned char *data){
86
+  *(((int *)tab)+2*g)=*((int *)data);
87
+  *(((int *)tab)+2*g+1)=*(((int *)data)+1);
88
+}
89
+
90
+void static inline FFTABLEOUT(unsigned char *data, unsigned char *tab, int g){
91
+  *((int *)data)=*(((int *)tab)+2*g);
92
+  *(((int *)data)+1)=*(((int *)tab)+2*g+1);
93
+}
94
+
95
+void static inline FFTABLEOUTXORNBY(int n, unsigned char *data, unsigned char *tab, int g){
96
+  int j;
97
+  for(j=0;j<n;j++){
98
+    *(data+j)^=*(tab+8*g+j);
99
+  }
100
+}
101
+
102
+
103
+struct batch_t{
104
+  unsigned int s1,s2,s3,s4;
105
+};
106
+typedef struct batch_t batch;
107
+
108
+#define BYTES_PER_BATCH 16
109
+
110
+batch static inline B_FFAND(batch a,batch b){
111
+  batch res;
112
+  res.s1=a.s1&b.s1;
113
+  res.s2=a.s2&b.s2;
114
+  res.s3=a.s3&b.s3;
115
+  res.s4=a.s4&b.s4;
116
+  return res;
117
+}
118
+
119
+batch static inline B_FFOR(batch a,batch b){
120
+  batch res;
121
+  res.s1=a.s1|b.s1;
122
+  res.s2=a.s2|b.s2;
123
+  res.s3=a.s3|b.s3;
124
+  res.s4=a.s4|b.s4;
125
+  return res;
126
+}
127
+
128
+batch static inline B_FFXOR(batch a,batch b){
129
+  batch res;
130
+  res.s1=a.s1^b.s1;
131
+  res.s2=a.s2^b.s2;
132
+  res.s3=a.s3^b.s3;
133
+  res.s4=a.s4^b.s4;
134
+  return res;
135
+}
136
+
137
+
138
+batch static inline B_FFN_ALL_29(){
139
+  batch res;
140
+  res.s1=0x29292929;
141
+  res.s2=0x29292929;
142
+  res.s3=0x29292929;
143
+  res.s4=0x29292929;
144
+  return res;
145
+}
146
+batch static inline B_FFN_ALL_02(){
147
+  batch res;
148
+  res.s1=0x02020202;
149
+  res.s2=0x02020202;
150
+  res.s3=0x02020202;
151
+  res.s4=0x02020202;
152
+  return res;
153
+}
154
+batch static inline B_FFN_ALL_04(){
155
+  batch res;
156
+  res.s1=0x04040404;
157
+  res.s2=0x04040404;
158
+  res.s3=0x04040404;
159
+  res.s4=0x04040404;
160
+  return res;
161
+}
162
+batch static inline B_FFN_ALL_10(){
163
+  batch res;
164
+  res.s1=0x10101010;
165
+  res.s2=0x10101010;
166
+  res.s3=0x10101010;
167
+  res.s4=0x10101010;
168
+  return res;
169
+}
170
+batch static inline B_FFN_ALL_40(){
171
+  batch res;
172
+  res.s1=0x40404040;
173
+  res.s2=0x40404040;
174
+  res.s3=0x40404040;
175
+  res.s4=0x40404040;
176
+  return res;
177
+}
178
+batch static inline B_FFN_ALL_80(){
179
+  batch res;
180
+  res.s1=0x80808080;
181
+  res.s2=0x80808080;
182
+  res.s3=0x80808080;
183
+  res.s4=0x80808080;
184
+  return res;
185
+}
186
+
187
+batch static inline B_FFSH8L(batch a,int n){
188
+  batch res;
189
+  res.s1=a.s1<<n;
190
+  res.s2=a.s2<<n;
191
+  res.s3=a.s3<<n;
192
+  res.s4=a.s4<<n;
193
+  return res;
194
+}
195
+
196
+batch static inline B_FFSH8R(batch a,int n){
197
+  batch res;
198
+  res.s1=a.s1>>n;
199
+  res.s2=a.s2>>n;
200
+  res.s3=a.s3>>n;
201
+  res.s4=a.s4>>n;
202
+  return res;
203
+}
204
+
205
+
206
+void static inline M_EMPTY(void){
207
+}

+ 95
- 0
FFdecsa/parallel_128_sse.h View File

@@ -0,0 +1,95 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2007 Dark Avenger
4
+ *               2003-2004  fatih89r
5
+ *
6
+ * This program is free software; you can redistribute it and/or modify
7
+ * it under the terms of the GNU General Public License as published by
8
+ * the Free Software Foundation; either version 2 of the License, or
9
+ * (at your option) any later version.
10
+ *
11
+ * This program is distributed in the hope that it will be useful,
12
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14
+ * GNU General Public License for more details.
15
+ *
16
+ * You should have received a copy of the GNU General Public License
17
+ * along with this program; if not, write to the Free Software
18
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
+ */
20
+
21
+
22
+#include <xmmintrin.h>
23
+
24
+#define MEMALIGN __attribute__((aligned(16)))
25
+
26
+union __u128 {
27
+    unsigned int u[4];
28
+    __m128 v;
29
+};
30
+
31
+static const union __u128 ff0 = {{0x00000000U, 0x00000000U, 0x00000000U, 0x00000000U}};
32
+static const union __u128 ff1 = {{0xffffffffU, 0xffffffffU, 0xffffffffU, 0xffffffffU}};
33
+
34
+typedef __m128 group;
35
+#define GROUP_PARALLELISM 128
36
+#define FF0() ff0.v
37
+#define FF1() ff1.v
38
+#define FFAND(a,b) _mm_and_ps((a),(b))
39
+#define FFOR(a,b)  _mm_or_ps((a),(b))
40
+#define FFXOR(a,b) _mm_xor_ps((a),(b))
41
+#define FFNOT(a)   _mm_xor_ps((a),FF1())
42
+#define MALLOC(X)  _mm_malloc(X,16)
43
+#define FREE(X)    _mm_free(X)
44
+
45
+union __u64 {
46
+    unsigned int u[2];
47
+    __m64 v;
48
+};
49
+
50
+static const union __u64 ff29 = {{0x29292929U, 0x29292929U}};
51
+static const union __u64 ff02 = {{0x02020202U, 0x02020202U}};
52
+static const union __u64 ff04 = {{0x04040404U, 0x04040404U}};
53
+static const union __u64 ff10 = {{0x10101010U, 0x10101010U}};
54
+static const union __u64 ff40 = {{0x40404040U, 0x40404040U}};
55
+static const union __u64 ff80 = {{0x80808080U, 0x80808080U}};
56
+
57
+typedef __m64 batch;
58
+#define BYTES_PER_BATCH 8
59
+#define B_FFN_ALL_29() ff29.v
60
+#define B_FFN_ALL_02() ff02.v
61
+#define B_FFN_ALL_04() ff04.v
62
+#define B_FFN_ALL_10() ff10.v
63
+#define B_FFN_ALL_40() ff40.v
64
+#define B_FFN_ALL_80() ff80.v
65
+#define B_FFAND(a,b)  _mm_and_si64((a),(b))
66
+#define B_FFOR(a,b)   _mm_or_si64((a),(b))
67
+#define B_FFXOR(a,b)  _mm_xor_si64((a),(b))
68
+#define B_FFSH8L(a,n) _mm_slli_si64((a),(n))
69
+#define B_FFSH8R(a,n) _mm_srli_si64((a),(n))
70
+
71
+#define M_EMPTY()     _mm_empty()
72
+
73
+
74
+#undef XOR_8_BY
75
+#define XOR_8_BY(d,s1,s2)    do { *(__m64*)d = _mm_xor_si64(*(__m64*)(s1), *(__m64*)(s2)); } while(0)
76
+
77
+#undef XOREQ_8_BY
78
+#define XOREQ_8_BY(d,s)      XOR_8_BY(d, d, s)
79
+
80
+#undef COPY_8_BY
81
+#define COPY_8_BY(d,s)       do { *(__m64 *)(d) = *(__m64 *)(s); } while(0)
82
+
83
+#undef BEST_SPAN
84
+#define BEST_SPAN            16
85
+
86
+#undef XOR_BEST_BY
87
+static inline void XOR_BEST_BY(unsigned char *d, unsigned char *s1, unsigned char *s2)
88
+{
89
+    __m128 vs1 = _mm_load_ps((float*)s1);
90
+    __m128 vs2 = _mm_load_ps((float*)s2);
91
+    vs1 = _mm_xor_ps(vs1, vs2);
92
+    _mm_store_ps((float*)d, vs1);
93
+}
94
+
95
+#include "fftable.h"

+ 82
- 0
FFdecsa/parallel_128_sse2.h View File

@@ -0,0 +1,82 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2007 Dark Avenger
4
+ *               2003-2004  fatih89r
5
+ *
6
+ * This program is free software; you can redistribute it and/or modify
7
+ * it under the terms of the GNU General Public License as published by
8
+ * the Free Software Foundation; either version 2 of the License, or
9
+ * (at your option) any later version.
10
+ *
11
+ * This program is distributed in the hope that it will be useful,
12
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
13
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
14
+ * GNU General Public License for more details.
15
+ *
16
+ * You should have received a copy of the GNU General Public License
17
+ * along with this program; if not, write to the Free Software
18
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
19
+ */
20
+
21
+#include <emmintrin.h>
22
+
23
+#define MEMALIGN __attribute__((aligned(16)))
24
+
25
+union __u128i {
26
+	unsigned int u[4];
27
+	__m128i v;
28
+};
29
+
30
+static const union __u128i ff0 = {{0x00000000U, 0x00000000U, 0x00000000U, 0x00000000U}};
31
+static const union __u128i ff1 = {{0xffffffffU, 0xffffffffU, 0xffffffffU, 0xffffffffU}};
32
+
33
+typedef __m128i group;
34
+#define GROUP_PARALLELISM 128
35
+#define FF0() ff0.v
36
+#define FF1() ff1.v
37
+#define FFAND(a,b) _mm_and_si128((a),(b))
38
+#define FFOR(a,b)  _mm_or_si128((a),(b))
39
+#define FFXOR(a,b) _mm_xor_si128((a),(b))
40
+#define FFNOT(a)   _mm_xor_si128((a),FF1())
41
+#define MALLOC(X)  _mm_malloc(X,16)
42
+#define FREE(X)    _mm_free(X)
43
+
44
+/* BATCH */
45
+
46
+static const union __u128i ff29 = {{0x29292929U, 0x29292929U, 0x29292929U, 0x29292929U}};
47
+static const union __u128i ff02 = {{0x02020202U, 0x02020202U, 0x02020202U, 0x02020202U}};
48
+static const union __u128i ff04 = {{0x04040404U, 0x04040404U, 0x04040404U, 0x04040404U}};
49
+static const union __u128i ff10 = {{0x10101010U, 0x10101010U, 0x10101010U, 0x10101010U}};
50
+static const union __u128i ff40 = {{0x40404040U, 0x40404040U, 0x40404040U, 0x40404040U}};
51
+static const union __u128i ff80 = {{0x80808080U, 0x80808080U, 0x80808080U, 0x80808080U}};
52
+
53
+typedef __m128i batch;
54
+#define BYTES_PER_BATCH 16
55
+#define B_FFN_ALL_29() ff29.v
56
+#define B_FFN_ALL_02() ff02.v
57
+#define B_FFN_ALL_04() ff04.v
58
+#define B_FFN_ALL_10() ff10.v
59
+#define B_FFN_ALL_40() ff40.v
60
+#define B_FFN_ALL_80() ff80.v
61
+
62
+#define B_FFAND(a,b) FFAND(a,b)
63
+#define B_FFOR(a,b)  FFOR(a,b)
64
+#define B_FFXOR(a,b) FFXOR(a,b)
65
+#define B_FFSH8L(a,n) _mm_slli_epi64((a),(n))
66
+#define B_FFSH8R(a,n) _mm_srli_epi64((a),(n))
67
+
68
+#define M_EMPTY()
69
+
70
+#undef BEST_SPAN
71
+#define BEST_SPAN            16
72
+
73
+#undef XOR_BEST_BY
74
+static inline void XOR_BEST_BY(unsigned char *d, unsigned char *s1, unsigned char *s2)
75
+{
76
+	__m128i vs1 = _mm_load_si128((__m128i*)s1);
77
+	__m128i vs2 = _mm_load_si128((__m128i*)s2);
78
+	vs1 = _mm_xor_si128(vs1, vs2);
79
+	_mm_store_si128((__m128i*)d, vs1);
80
+}
81
+
82
+#include "fftable.h"

+ 102
- 0
FFdecsa/parallel_generic.h View File

@@ -0,0 +1,102 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+
22
+#if 0
23
+//// generics
24
+#define COPY4BY(d,s)     do{ int *pd=(int *)(d), *ps=(int *)(s); \
25
+                             *pd = *ps; }while(0)
26
+#define COPY8BY(d,s)     do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
27
+                             *pd = *ps; }while(0)
28
+#define COPY16BY(d,s)    do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
29
+                             *pd = *ps; \
30
+			     *(pd+1) = *(ps+1); }while(0)
31
+#define COPY32BY(d,s)    do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
32
+                             *pd = *ps; \
33
+			     *(pd+1) = *(ps+1) \
34
+			     *(pd+2) = *(ps+2) \
35
+			     *(pd+3) = *(ps+3); }while(0)
36
+#define XOR4BY(d,s1,s2)  do{ int *pd=(int *)(d), *ps1=(int *)(s1), *ps2=(int *)(s2); \
37
+                             *pd = *ps1  ^ *ps2; }while(0)
38
+#define XOR8BY(d,s1,s2)  do{ long long int *pd=(long long int *)(d), *ps1=(long long int *)(s1), *ps2=(long long int *)(s2); \
39
+                             *pd = *ps1  ^ *ps2; }while(0)
40
+#define XOR16BY(d,s1,s2) do{ long long int *pd=(long long int *)(d), *ps1=(long long int *)(s1), *ps2=(long long int *)(s2); \
41
+                             *pd = *ps1  ^ *ps2; \
42
+                             *(pd+8) = *(ps1+8)  ^ *(ps2+8); }while(0)
43
+#define XOR32BY(d,s1,s2) do{ long long int *pd=(long long int *)(d), *ps1=(long long int *)(s1), *ps2=(long long int *)(s2); \
44
+                             *pd = *ps1  ^ *ps2; \
45
+                             *(pd+1) = *(ps1+1)  ^ *(ps2+1); \
46
+                             *(pd+2) = *(ps1+2)  ^ *(ps2+2); \
47
+                             *(pd+3) = *(ps1+3)  ^ *(ps2+3); }while(0)
48
+#define XOR32BV(d,s1,s2) do{ int *const pd=(int *const)(d), *ps1=(const int *const)(s1), *ps2=(const int *const)(s2); \
49
+                             int z; \
50
+			     for(z=0;z<8;z++){ \
51
+                               pd[z]=ps1[z]^ps2[z]; \
52
+			     } \
53
+                           }while(0)
54
+#define XOREQ4BY(d,s)    do{ int *pd=(int *)(d), *ps=(int *)(s); \
55
+                             *pd ^= *ps; }while(0)
56
+#define XOREQ8BY(d,s)    do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
57
+                             *pd ^= *ps; }while(0)
58
+#define XOREQ16BY(d,s)   do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
59
+                             *pd ^= *ps; \
60
+			     *(pd+1) ^=*(ps+1); }while(0)
61
+#define XOREQ32BY(d,s)   do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
62
+                             *pd ^= *ps; \
63
+			     *(pd+1) ^=*(ps+1); \
64
+			     *(pd+2) ^=*(ps+2); \
65
+			     *(pd+3) ^=*(ps+3); }while(0)
66
+#define XOREQ32BY4(d,s)  do{ int *pd=(int *)(d), *ps=(int *)(s); \
67
+                             *pd ^= *ps; \
68
+			     *(pd+1) ^=*(ps+1); \
69
+			     *(pd+2) ^=*(ps+2); \
70
+			     *(pd+3) ^=*(ps+3); \
71
+			     *(pd+4) ^=*(ps+4); \
72
+			     *(pd+5) ^=*(ps+5); \
73
+			     *(pd+6) ^=*(ps+6); \
74
+			     *(pd+7) ^=*(ps+7); }while(0)
75
+#define XOREQ32BV(d,s)   do{ unsigned char *pd=(unsigned char *)(d), *ps=(unsigned char *)(s); \
76
+                             int z; \
77
+			     for(z=0;z<32;z++){ \
78
+                               pd[z]^=ps[z]; \
79
+			     } \
80
+                           }while(0)
81
+
82
+#else
83
+#define XOR_4_BY(d,s1,s2)    do{ int *pd=(int *)(d), *ps1=(int *)(s1), *ps2=(int *)(s2); \
84
+                               *pd = *ps1  ^ *ps2; }while(0)
85
+#define XOR_8_BY(d,s1,s2)    do{ long long int *pd=(long long int *)(d), *ps1=(long long int *)(s1), *ps2=(long long int *)(s2); \
86
+                               *pd = *ps1  ^ *ps2; }while(0)
87
+#define XOREQ_4_BY(d,s)      do{ int *pd=(int *)(d), *ps=(int *)(s); \
88
+                               *pd ^= *ps; }while(0)
89
+#define XOREQ_8_BY(d,s)      do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
90
+                               *pd ^= *ps; }while(0)
91
+#define COPY_4_BY(d,s)       do{ int *pd=(int *)(d), *ps=(int *)(s); \
92
+                               *pd = *ps; }while(0)
93
+#define COPY_8_BY(d,s)       do{ long long int *pd=(long long int *)(d), *ps=(long long int *)(s); \
94
+                               *pd = *ps; }while(0)
95
+
96
+#define BEST_SPAN            8
97
+#define XOR_BEST_BY(d,s1,s2) do{ XOR_8_BY(d,s1,s2); }while(0);
98
+#define XOREQ_BEST_BY(d,s)   do{ XOREQ_8_BY(d,s); }while(0);
99
+#define COPY_BEST_BY(d,s)    do{ COPY_8_BY(d,s); }while(0);
100
+
101
+#define END_MM             do{ }while(0);
102
+#endif

+ 29
- 0
FFdecsa/parallel_std_def.h View File

@@ -0,0 +1,29 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+#define FFXOR(a,b) ((a)^(b))
21
+#define FFAND(a,b) ((a)&(b))
22
+#define FFOR(a,b)  ((a)|(b))
23
+#define FFNOT(a)   (~(a))
24
+
25
+#define B_FFAND(a,b) ((a)&(b))
26
+#define B_FFOR(a,b)  ((a)|(b))
27
+#define B_FFXOR(a,b) ((a)^(b))
28
+#define B_FFSH8L(a,n) ((a)<<(n))
29
+#define B_FFSH8R(a,n) ((a)>>(n))

+ 906
- 0
FFdecsa/stream.c View File

@@ -0,0 +1,906 @@
1
+/* FFdecsa -- fast decsa algorithm
2
+ *
3
+ * Copyright (C) 2003-2004  fatih89r
4
+ *
5
+ * This program is free software; you can redistribute it and/or modify
6
+ * it under the terms of the GNU General Public License as published by
7
+ * the Free Software Foundation; either version 2 of the License, or
8
+ * (at your option) any later version.
9
+ *
10
+ * This program is distributed in the hope that it will be useful,
11
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
13
+ * GNU General Public License for more details.
14
+ *
15
+ * You should have received a copy of the GNU General Public License
16
+ * along with this program; if not, write to the Free Software
17
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
18
+ */
19
+
20
+
21
+
22
+// define statics only once, when STREAM_INIT
23
+#ifdef STREAM_INIT
24
+struct stream_regs {
25
+  group A[32+10][4]; // 32 because we will move back (virtual shift register)
26
+  group B[32+10][4]; // 32 because we will move back (virtual shift register)
27
+  group X[4];
28
+  group Y[4];
29
+  group Z[4];
30
+  group D[4];
31
+  group E[4];
32
+  group F[4];
33
+  group p;
34
+  group q;
35
+  group r;
36
+  };
37
+
38
+static inline void trasp64_32_88ccw(unsigned char *data){
39
+/* 64 rows of 32 bits transposition (bytes transp. - 8x8 rotate counterclockwise)*/
40
+#define row ((unsigned int *)data)
41
+  int i,j;
42
+  for(j=0;j<64;j+=32){
43
+    unsigned int t,b;
44
+    for(i=0;i<16;i++){
45
+      t=row[j+i];
46
+      b=row[j+16+i];
47
+      row[j+i]   = (t&0x0000ffff)      | ((b           )<<16);
48
+      row[j+16+i]=((t           )>>16) |  (b&0xffff0000) ;
49
+    }
50
+  }
51
+  for(j=0;j<64;j+=16){
52
+    unsigned int t,b;
53
+    for(i=0;i<8;i++){
54
+      t=row[j+i];
55
+      b=row[j+8+i];
56
+      row[j+i]   = (t&0x00ff00ff)     | ((b&0x00ff00ff)<<8);
57
+      row[j+8+i] =((t&0xff00ff00)>>8) |  (b&0xff00ff00);
58
+    }
59
+  }
60
+  for(j=0;j<64;j+=8){
61
+    unsigned int t,b;
62
+    for(i=0;i<4;i++){
63
+      t=row[j+i];
64
+      b=row[j+4+i];
65
+      row[j+i]   =((t&0x0f0f0f0f)<<4) |  (b&0x0f0f0f0f);
66
+      row[j+4+i] = (t&0xf0f0f0f0)     | ((b&0xf0f0f0f0)>>4);
67
+    }
68
+  }
69
+  for(j=0;j<64;j+=4){
70
+    unsigned int t,b;
71
+    for(i=0;i<2;i++){
72
+      t=row[j+i];
73
+      b=row[j+2+i];
74
+      row[j+i]   =((t&0x33333333)<<2) |  (b&0x33333333);
75
+      row[j+2+i] = (t&0xcccccccc)     | ((b&0xcccccccc)>>2);
76
+    }
77
+  }
78
+  for(j=0;j<64;j+=2){
79
+    unsigned int t,b;
80
+    for(i=0;i<1;i++){
81
+      t=row[j+i];
82
+      b=row[j+1+i];
83
+      row[j+i]   =((t&0x55555555)<<1) |  (b&0x55555555);
84
+      row[j+1+i] = (t&0xaaaaaaaa)     | ((b&0xaaaaaaaa)>>1);
85
+    }
86
+  }
87
+#undef row
88
+}
89
+
90
+static inline void trasp64_32_88cw(unsigned char *data){
91
+/* 64 rows of 32 bits transposition (bytes transp. - 8x8 rotate clockwise)*/
92
+#define row ((unsigned int *)data)
93
+  int i,j;
94
+  for(j=0;j<64;j+=32){
95
+    unsigned int t,b;
96
+    for(i=0;i<16;i++){
97
+      t=row[j+i];
98
+      b=row[j+16+i];
99
+      row[j+i]   = (t&0x0000ffff)      | ((b           )<<16);
100
+      row[j+16+i]=((t           )>>16) |  (b&0xffff0000) ;
101
+    }
102
+  }
103
+  for(j=0;j<64;j+=16){
104
+    unsigned int t,b;
105
+    for(i=0;i<8;i++){
106
+      t=row[j+i];
107
+      b=row[j+8+i];
108
+      row[j+i]   = (t&0x00ff00ff)     | ((b&0x00ff00ff)<<8);
109
+      row[j+8+i] =((t&0xff00ff00)>>8) |  (b&0xff00ff00);
110
+    }
111
+  }
112
+  for(j=0;j<64;j+=8){
113
+    unsigned int t,b;
114
+    for(i=0;i<4;i++){
115
+      t=row[j+i];
116
+      b=row[j+4+i];
117
+      row[j+i]  =((t&0xf0f0f0f0)>>4) |   (b&0xf0f0f0f0);
118
+      row[j+4+i]= (t&0x0f0f0f0f)     |  ((b&0x0f0f0f0f)<<4);
119
+    }
120
+  }
121
+  for(j=0;j<64;j+=4){
122
+    unsigned int t,b;
123
+    for(i=0;i<2;i++){
124
+      t=row[j+i];
125
+      b=row[j+2+i];
126
+      row[j+i]  =((t&0xcccccccc)>>2) |  (b&0xcccccccc);
127
+      row[j+2+i]= (t&0x33333333)     | ((b&0x33333333)<<2);
128
+    }
129
+  }
130
+  for(j=0;j<64;j+=2){
131
+    unsigned int t,b;
132
+    for(i=0;i<1;i++){
133
+      t=row[j+i];
134
+      b=row[j+1+i];
135
+      row[j+i]  =((t&0xaaaaaaaa)>>1) |  (b&0xaaaaaaaa);
136
+      row[j+1+i]= (t&0x55555555)     | ((b&0x55555555)<<1);
137
+    }
138
+  }
139
+#undef row
140
+}
141
+
142
+//64-64----------------------------------------------------------
143
+static inline void trasp64_64_88ccw(unsigned char *data){
144
+/* 64 rows of 64 bits transposition (bytes transp. - 8x8 rotate counterclockwise)*/
145
+#define row ((unsigned long long int *)data)
146
+  int i,j;
147
+  for(j=0;j<64;j+=64){
148
+    unsigned long long int t,b;
149
+    for(i=0;i<32;i++){
150
+      t=row[j+i];
151
+      b=row[j+32+i];
152
+      row[j+i]   = (t&0x00000000ffffffffULL)      | ((b                      )<<32);
153
+      row[j+32+i]=((t                      )>>32) |  (b&0xffffffff00000000ULL) ;
154
+    }
155
+  }
156
+  for(j=0;j<64;j+=32){
157
+    unsigned long long int t,b;
158
+    for(i=0;i<16;i++){
159
+      t=row[j+i];
160
+      b=row[j+16+i];
161
+      row[j+i]   = (t&0x0000ffff0000ffffULL)      | ((b&0x0000ffff0000ffffULL)<<16);
162
+      row[j+16+i]=((t&0xffff0000ffff0000ULL)>>16) |  (b&0xffff0000ffff0000ULL) ;
163
+    }
164
+  }
165
+  for(j=0;j<64;j+=16){
166
+    unsigned long long int t,b;
167
+    for(i=0;i<8;i++){
168
+      t=row[j+i];
169
+      b=row[j+8+i];
170
+      row[j+i]   = (t&0x00ff00ff00ff00ffULL)     | ((b&0x00ff00ff00ff00ffULL)<<8);
171
+      row[j+8+i] =((t&0xff00ff00ff00ff00ULL)>>8) |  (b&0xff00ff00ff00ff00ULL);
172
+    }
173
+  }
174
+  for(j=0;j<64;j+=8){
175
+    unsigned long long int t,b;
176
+    for(i=0;i<4;i++){
177
+      t=row[j+i];
178
+      b=row[j+4+i];
179
+      row[j+i]   =((t&0x0f0f0f0f0f0f0f0fULL)<<4) |  (b&0x0f0f0f0f0f0f0f0fULL);
180
+      row[j+4+i] = (t&0xf0f0f0f0f0f0f0f0ULL)     | ((b&0xf0f0f0f0f0f0f0f0ULL)>>4);
181
+    }
182
+  }
183
+  for(j=0;j<64;j+=4){
184
+    unsigned long long int t,b;
185
+    for(i=0;i<2;i++){
186
+      t=row[j+i];
187
+      b=row[j+2+i];
188
+      row[j+i]   =((t&0x3333333333333333ULL)<<2) |  (b&0x3333333333333333ULL);
189
+      row[j+2+i] = (t&0xccccccccccccccccULL)     | ((b&0xccccccccccccccccULL)>>2);
190
+    }
191
+  }
192
+  for(j=0;j<64;j+=2){
193
+    unsigned long long int t,b;
194
+    for(i=0;i<1;i++){
195
+      t=row[j+i];
196
+      b=row[j+1+i];
197
+      row[j+i]   =((t&0x5555555555555555ULL)<<1) |  (b&0x5555555555555555ULL);
198
+      row[j+1+i] = (t&0xaaaaaaaaaaaaaaaaULL)     | ((b&0xaaaaaaaaaaaaaaaaULL)>>1);
199
+    }
200
+  }
201
+#undef row
202
+}
203
+
204
+static inline void trasp64_64_88cw(unsigned char *data){
205
+/* 64 rows of 64 bits transposition (bytes transp. - 8x8 rotate clockwise)*/
206
+#define row ((unsigned long long int *)data)
207
+  int i,j;
208
+  for(j=0;j<64;j+=64){
209
+    unsigned long long int t,b;
210
+    for(i=0;i<32;i++){
211
+      t=row[j+i];
212
+      b=row[j+32+i];
213
+      row[j+i]   = (t&0x00000000ffffffffULL)      | ((b                      )<<32);
214
+      row[j+32+i]=((t                      )>>32) |  (b&0xffffffff00000000ULL) ;
215
+    }
216
+  }
217
+  for(j=0;j<64;j+=32){
218
+    unsigned long long int t,b;
219
+    for(i=0;i<16;i++){
220
+      t=row[j+i];
221
+      b=row[j+16+i];
222
+      row[j+i]   = (t&0x0000ffff0000ffffULL)      | ((b&0x0000ffff0000ffffULL)<<16);
223
+      row[j+16+i]=((t&0xffff0000ffff0000ULL)>>16) |  (b&0xffff0000ffff0000ULL) ;
224
+    }
225
+  }
226
+  for(j=0;j<64;j+=16){
227
+    unsigned long long int t,b;
228
+    for(i=0;i<8;i++){
229
+      t=row[j+i];
230
+      b=row[j+8+i];
231
+      row[j+i]   = (t&0x00ff00ff00ff00ffULL)     | ((b&0x00ff00ff00ff00ffULL)<<8);
232
+      row[j+8+i] =((t&0xff00ff00ff00ff00ULL)>>8) |  (b&0xff00ff00ff00ff00ULL);
233
+    }
234
+  }
235
+  for(j=0;j<64;j+=8){
236
+    unsigned long long int t,b;
237
+    for(i=0;i<4;i++){
238
+      t=row[j+i];
239
+      b=row[j+4+i];
240
+      row[j+i]   =((t&0xf0f0f0f0f0f0f0f0ULL)>>4) |   (b&0xf0f0f0f0f0f0f0f0ULL);
241
+      row[j+4+i] = (t&0x0f0f0f0f0f0f0f0fULL)     |  ((b&0x0f0f0f0f0f0f0f0fULL)<<4);
242
+    }
243
+  }
244
+  for(j=0;j<64;j+=4){
245
+    unsigned long long int t,b;
246
+    for(i=0;i<2;i++){
247
+      t=row[j+i];
248
+      b=row[j+2+i];
249
+      row[j+i]   =((t&0xccccccccccccccccULL)>>2) |  (b&0xccccccccccccccccULL);
250
+      row[j+2+i] = (t&0x3333333333333333ULL)     | ((b&0x3333333333333333ULL)<<2);
251
+    }
252
+  }
253
+  for(j=0;j<64;j+=2){
254
+    unsigned long long int t,b;
255
+    for(i=0;i<1;i++){
256
+      t=row[j+i];
257
+      b=row[j+1+i];
258
+      row[j+i]   =((t&0xaaaaaaaaaaaaaaaaULL)>>1) |  (b&0xaaaaaaaaaaaaaaaaULL);
259
+      row[j+1+i] = (t&0x5555555555555555ULL)     | ((b&0x5555555555555555ULL)<<1);
260
+    }
261
+  }
262
+#undef row
263
+}
264
+
265
+//64-128----------------------------------------------------------
266
+static inline void trasp64_128_88ccw(unsigned char *data){
267
+/* 64 rows of 128 bits transposition (bytes transp. - 8x8 rotate counterclockwise)*/
268
+#define halfrow ((unsigned long long int *)data)
269
+  int i,j;
270
+  for(j=0;j<64;j+=64){
271
+    unsigned long long int t,b;
272
+    for(i=0;i<32;i++){
273
+      t=halfrow[2*(j+i)];
274
+      b=halfrow[2*(j+32+i)];
275
+      halfrow[2*(j+i)]   = (t&0x00000000ffffffffULL)      | ((b                      )<<32);
276
+      halfrow[2*(j+32+i)]=((t                      )>>32) |  (b&0xffffffff00000000ULL) ;
277
+      t=halfrow[2*(j+i)+1];
278
+      b=halfrow[2*(j+32+i)+1];
279
+      halfrow[2*(j+i)+1]   = (t&0x00000000ffffffffULL)      | ((b                      )<<32);
280
+      halfrow[2*(j+32+i)+1]=((t                      )>>32) |  (b&0xffffffff00000000ULL) ;
281
+    }
282
+  }
283
+  for(j=0;j<64;j+=32){
284
+    unsigned long long int t,b;
285
+    for(i=0;i<16;i++){
286
+      t=halfrow[2*(j+i)];
287
+      b=halfrow[2*(j+16+i)];
288
+      halfrow[2*(j+i)]   = (t&0x0000ffff0000ffffULL)      | ((b&0x0000ffff0000ffffULL)<<16);
289
+      halfrow[2*(j+16+i)]=((t&0xffff0000ffff0000ULL)>>16) |  (b&0xffff0000ffff0000ULL) ;
290
+      t=halfrow[2*(j+i)+1];
291
+      b=halfrow[2*(j+16+i)+1];
292
+      halfrow[2*(j+i)+1]   = (t&0x0000ffff0000ffffULL)      | ((b&0x0000ffff0000ffffULL)<<16);
293
+      halfrow[2*(j+16+i)+1]=((t&0xffff0000ffff0000ULL)>>16) |  (b&0xffff0000ffff0000ULL) ;
294
+    }
295
+  }
296
+  for(j=0;j<64;j+=16){
297
+    unsigned long long int t,b;
298
+    for(i=0;i<8;i++){
299
+      t=halfrow[2*(j+i)];
300
+      b=halfrow[2*(j+8+i)];
301
+      halfrow[2*(j+i)]   = (t&0x00ff00ff00ff00ffULL)     | ((b&0x00ff00ff00ff00ffULL)<<8);
302
+      halfrow[2*(j+8+i)] =((t&0xff00ff00ff00ff00ULL)>>8) |  (b&0xff00ff00ff00ff00ULL);
303
+      t=halfrow[2*(j+i)+1];
304
+      b=halfrow[2*(j+8+i)+1];
305
+      halfrow[2*(j+i)+1]   = (t&0x00ff00ff00ff00ffULL)     | ((b&0x00ff00ff00ff00ffULL)<<8);
306
+      halfrow[2*(j+8+i)+1] =((t&0xff00ff00ff00ff00ULL)>>8) |  (b&0xff00ff00ff00ff00ULL);
307
+    }
308
+  }
309
+  for(j=0;j<64;j+=8){
310
+    unsigned long long int t,b;
311
+    for(i=0;i<4;i++){
312
+      t=halfrow[2*(j+i)];
313
+      b=halfrow[2*(j+4+i)];
314
+      halfrow[2*(j+i)]   =((t&0x0f0f0f0f0f0f0f0fULL)<<4) |  (b&0x0f0f0f0f0f0f0f0fULL);
315
+      halfrow[2*(j+4+i)] = (t&0xf0f0f0f0f0f0f0f0ULL)     | ((b&0xf0f0f0f0f0f0f0f0ULL)>>4);
316
+      t=halfrow[2*(j+i)+1];
317
+      b=halfrow[2*(j+4+i)+1];
318
+      halfrow[2*(j+i)+1]   =((t&0x0f0f0f0f0f0f0f0fULL)<<4) |  (b&0x0f0f0f0f0f0f0f0fULL);
319
+      halfrow[2*(j+4+i)+1] = (t&0xf0f0f0f0f0f0f0f0ULL)     | ((b&0xf0f0f0f0f0f0f0f0ULL)>>4);
320
+    }
321
+  }
322
+  for(j=0;j<64;j+=4){
323
+    unsigned long long int t,b;
324
+    for(i=0;i<2;i++){
325
+      t=halfrow[2*(j+i)];
326
+      b=halfrow[2*(j+2+i)];
327
+      halfrow[2*(j+i)]   =((t&0x3333333333333333ULL)<<2) |  (b&0x3333333333333333ULL);
328
+      halfrow[2*(j+2+i)] = (t&0xccccccccccccccccULL)     | ((b&0xccccccccccccccccULL)>>2);
329
+      t=halfrow[2*(j+i)+1];
330
+      b=halfrow[2*(j+2+i)+1];
331
+      halfrow[2*(j+i)+1]   =((t&0x3333333333333333ULL)<<2) |  (b&0x3333333333333333ULL);
332
+      halfrow[2*(j+2+i)+1] = (t&0xccccccccccccccccULL)     | ((b&0xccccccccccccccccULL)>>2);
333
+    }
334
+  }
335
+  for(j=0;j<64;j+=2){
336
+    unsigned long long int t,b;
337
+    for(i=0;i<1;i++){
338
+      t=halfrow[2*(j+i)];
339
+      b=halfrow[2*(j+1+i)];
340
+      halfrow[2*(j+i)]   =((t&0x5555555555555555ULL)<<1) |  (b&0x5555555555555555ULL);
341
+      halfrow[2*(j+1+i)] = (t&0xaaaaaaaaaaaaaaaaULL)     | ((b&0xaaaaaaaaaaaaaaaaULL)>>1);
342
+      t=halfrow[2*(j+i)+1];
343
+      b=halfrow[2*(j+1+i)+1];
344
+      halfrow[2*(j+i)+1]   =((t&0x5555555555555555ULL)<<1) |  (b&0x5555555555555555ULL);
345
+      halfrow[2*(j+1+i)+1] = (t&0xaaaaaaaaaaaaaaaaULL)     | ((b&0xaaaaaaaaaaaaaaaaULL)>>1);
346
+    }
347
+  }
348
+#undef halfrow
349
+}
350
+
351
+static inline void trasp64_128_88cw(unsigned char *data){
352
+/* 64 rows of 128 bits transposition (bytes transp. - 8x8 rotate clockwise)*/
353
+#define halfrow ((unsigned long long int *)data)
354
+  int i,j;
355
+  for(j=0;j<64;j+=64){
356
+    unsigned long long int t,b;
357
+    for(i=0;i<32;i++){
358
+      t=halfrow[2*(j+i)];
359
+      b=halfrow[2*(j+32+i)];
360
+      halfrow[2*(j+i)]   = (t&0x00000000ffffffffULL)      | ((b                      )<<32);
361
+      halfrow[2*(j+32+i)]=((t                      )>>32) |  (b&0xffffffff00000000ULL) ;
362
+      t=halfrow[2*(j+i)+1];
363
+      b=halfrow[2*(j+32+i)+1];
364
+      halfrow[2*(j+i)+1]   = (t&0x00000000ffffffffULL)      | ((b                      )<<32);
365
+      halfrow[2*(j+32+i)+1]=((t                      )>>32) |  (b&0xffffffff00000000ULL) ;
366
+    }
367
+  }
368
+  for(j=0;j<64;j+=32){
369
+    unsigned long long int t,b;
370
+    for(i=0;i<16;i++){
371
+      t=halfrow[2*(j+i)];
372
+      b=halfrow[2*(j+16+i)];
373
+      halfrow[2*(j+i)]   = (t&0x0000ffff0000ffffULL)      | ((b&0x0000ffff0000ffffULL)<<16);
374
+      halfrow[2*(j+16+i)]=((t&0xffff0000ffff0000ULL)>>16) |  (b&0xffff0000ffff0000ULL) ;
375
+      t=halfrow[2*(j+i)+1];
376
+      b=halfrow[2*(j+16+i)+1];
377
+      halfrow[2*(j+i)+1]   = (t&0x0000ffff0000ffffULL)      | ((b&0x0000ffff0000ffffULL)<<16);
378
+      halfrow[2*(j+16+i)+1]=((t&0xffff0000ffff0000ULL)>>16) |  (b&0xffff0000ffff0000ULL) ;
379
+    }
380
+  }
381
+  for(j=0;j<64;j+=16){
382
+    unsigned long long int t,b;
383
+    for(i=0;i<8;i++){
384
+      t=halfrow[2*(j+i)];
385
+      b=halfrow[2*(j+8+i)];
386
+      halfrow[2*(j+i)]   = (t&0x00ff00ff00ff00ffULL)     | ((b&0x00ff00ff00ff00ffULL)<<8);
387
+      halfrow[2*(j+8+i)] =((t&0xff00ff00ff00ff00ULL)>>8) |  (b&0xff00ff00ff00ff00ULL);
388
+      t=halfrow[2*(j+i)+1];
389
+      b=halfrow[2*(j+8+i)+1];
390
+      halfrow[2*(j+i)+1]   = (t&0x00ff00ff00ff00ffULL)     | ((b&0x00ff00ff00ff00ffULL)<<8);
391
+      halfrow[2*(j+8+i)+1] =((t&0xff00ff00ff00ff00ULL)>>8) |  (b&0xff00ff00ff00ff00ULL);
392
+    }
393
+  }
394
+  for(j=0;j<64;j+=8){
395
+    unsigned long long int t,b;
396
+    for(i=0;i<4;i++){
397
+      t=halfrow[2*(j+i)];
398
+      b=halfrow[2*(j+4+i)];
399
+      halfrow[2*(j+i)]   =((t&0xf0f0f0f0f0f0f0f0ULL)>>4) |   (b&0xf0f0f0f0f0f0f0f0ULL);
400
+      halfrow[2*(j+4+i)] = (t&0x0f0f0f0f0f0f0f0fULL)     |  ((b&0x0f0f0f0f0f0f0f0fULL)<<4);
401
+      t=halfrow[2*(j+i)+1];
402
+      b=halfrow[2*(j+4+i)+1];
403
+      halfrow[2*(j+i)+1]   =((t&0xf0f0f0f0f0f0f0f0ULL)>>4) |   (b&0xf0f0f0f0f0f0f0f0ULL);
404
+      halfrow[2*(j+4+i)+1] = (t&0x0f0f0f0f0f0f0f0fULL)     |  ((b&0x0f0f0f0f0f0f0f0fULL)<<4);
405
+    }
406
+  }
407
+  for(j=0;j<64;j+=4){
408
+    unsigned long long int t,b;
409
+    for(i=0;i<2;i++){
410
+      t=halfrow[2*(j+i)];
411
+      b=halfrow[2*(j+2+i)];
412
+      halfrow[2*(j+i)]   =((t&0xccccccccccccccccULL)>>2) |  (b&0xccccccccccccccccULL);
413
+      halfrow[2*(j+2+i)] = (t&0x3333333333333333ULL)     | ((b&0x3333333333333333ULL)<<2);
414
+      t=halfrow[2*(j+i)+1];
415
+      b=halfrow[2*(j+2+i)+1];
416
+      halfrow[2*(j+i)+1]   =((t&0xccccccccccccccccULL)>>2) |  (b&0xccccccccccccccccULL);
417
+      halfrow[2*(j+2+i)+1] = (t&0x3333333333333333ULL)     | ((b&0x3333333333333333ULL)<<2);
418
+    }
419
+  }
420
+  for(j=0;j<64;j+=2){
421
+    unsigned long long int t,b;
422
+    for(i=0;i<1;i++){
423
+      t=halfrow[2*(j+i)];
424
+      b=halfrow[2*(j+1+i)];
425
+      halfrow[2*(j+i)]   =((t&0xaaaaaaaaaaaaaaaaULL)>>1) |  (b&0xaaaaaaaaaaaaaaaaULL);
426
+      halfrow[2*(j+1+i)] = (t&0x5555555555555555ULL)     | ((b&0x5555555555555555ULL)<<1);
427
+      t=halfrow[2*(j+i)+1];
428
+      b=halfrow[2*(j+1+i)+1];
429
+      halfrow[2*(j+i)+1]   =((t&0xaaaaaaaaaaaaaaaaULL)>>1) |  (b&0xaaaaaaaaaaaaaaaaULL);
430
+      halfrow[2*(j+1+i)+1] = (t&0x5555555555555555ULL)     | ((b&0x5555555555555555ULL)<<1);
431
+    }
432
+  }
433
+#undef halfrow
434
+}
435
+#endif
436
+
437
+
438
+#ifdef STREAM_INIT
439
+void stream_cypher_group_init(
440
+  struct stream_regs *regs,
441
+  group         iA[8][4], // [In]  iA00,iA01,...iA73 32 groups  | Derived from key.
442
+  group         iB[8][4], // [In]  iB00,iB01,...iB73 32 groups  | Derived from key.
443
+  unsigned char *sb)      // [In]  (SB0,SB1,...SB7)...x32 32*8 bytes | Extra input.
444
+#endif
445
+#ifdef STREAM_NORMAL
446
+void stream_cypher_group_normal(
447
+  struct stream_regs *regs,
448
+  unsigned char *cb)    // [Out] (CB0,CB1,...CB7)...x32 32*8 bytes | Output.
449
+#endif
450
+{
451
+#ifdef STREAM_INIT
452
+  group in1[4];
453
+  group in2[4];
454
+#endif
455
+  group extra_B[4];
456
+  group fa,fb,fc,fd,fe;
457
+  group s1a,s1b,s2a,s2b,s3a,s3b,s4a,s4b,s5a,s5b,s6a,s6b,s7a,s7b;
458
+  group next_E[4];
459
+  group tmp0,tmp1,tmp2,tmp3,tmp4;
460
+#ifdef STREAM_INIT
461
+  group *sb_g=(group *)sb;
462
+#endif
463
+#ifdef STREAM_NORMAL
464
+  group *cb_g=(group *)cb;
465
+#endif
466
+  int aboff;
467
+  int i,j,k,b;
468
+  int dbg;
469
+
470
+#ifdef STREAM_INIT
471
+  DBG(fprintf(stderr,":::::::::: BEGIN STREAM INIT\n"));
472
+#endif
473
+#ifdef STREAM_NORMAL
474
+  DBG(fprintf(stderr,":::::::::: BEGIN STREAM NORMAL\n"));
475
+#endif
476
+#ifdef STREAM_INIT
477
+for(j=0;j<64;j++){
478
+  DBG(fprintf(stderr,"precall prerot stream_in[%2i]=",j));
479
+  DBG(dump_mem("",sb+BYPG*j,BYPG,BYPG));
480
+}
481
+
482
+DBG(dump_mem("stream_prerot ",sb,GROUP_PARALLELISM*8,BYPG));
483
+#if GROUP_PARALLELISM==32
484
+trasp64_32_88ccw(sb);
485
+#endif
486
+#if GROUP_PARALLELISM==64
487
+trasp64_64_88ccw(sb);
488
+#endif
489
+#if GROUP_PARALLELISM==128
490
+trasp64_128_88ccw(sb);
491
+#endif
492
+DBG(dump_mem("stream_postrot",sb,GROUP_PARALLELISM*8,BYPG));
493
+
494
+for(j=0;j<64;j++){
495
+  DBG(fprintf(stderr,"precall stream_in[%2i]=",j));
496
+  DBG(dump_mem("",sb+BYPG*j,BYPG,BYPG));
497
+}
498
+#endif
499
+
500
+  aboff=32;
501
+
502
+#ifdef STREAM_INIT
503
+  // load first 32 bits of ck into A[aboff+0]..A[aboff+7]
504
+  // load last  32 bits of ck into B[aboff+0]..B[aboff+7]
505
+  // all other regs = 0
506
+  for(i=0;i<8;i++){
507
+    for(b=0;b<4;b++){
508
+DBG(fprintf(stderr,"dbg from iA A[%i][%i]=",i,b));
509
+DBG(dump_mem("",(unsigned char *)&iA[i][b],BYPG,BYPG));
510
+DBG(fprintf(stderr,"                                       dbg from iB B[%i][%i]=",i,b));
511
+DBG(dump_mem("",(unsigned char *)&iB[i][b],BYPG,BYPG));
512
+      regs->A[aboff+i][b]=iA[i][b];
513
+      regs->B[aboff+i][b]=iB[i][b];
514
+    }
515
+  }
516
+  for(b=0;b<4;b++){
517
+    regs->A[aboff+8][b]=FF0();
518
+    regs->A[aboff+9][b]=FF0();
519
+    regs->B[aboff+8][b]=FF0();
520
+    regs->B[aboff+9][b]=FF0();
521
+  }
522
+  for(b=0;b<4;b++){
523
+    regs->X[b]=FF0();
524
+    regs->Y[b]=FF0();
525
+    regs->Z[b]=FF0();
526
+    regs->D[b]=FF0();
527
+    regs->E[b]=FF0();
528
+    regs->F[b]=FF0();
529
+  }
530
+  regs->p=FF0();
531
+  regs->q=FF0();
532
+  regs->r=FF0();
533
+#endif
534
+
535
+for(dbg=0;dbg<4;dbg++){
536
+  DBG(fprintf(stderr,"dbg A0[%i]=",dbg));
537
+  DBG(dump_mem("",(unsigned char *)&regs->A[aboff+0][dbg],BYPG,BYPG));
538
+  DBG(fprintf(stderr,"dbg B0[%i]=",dbg));
539
+  DBG(dump_mem("",(unsigned char *)&regs->B[aboff+0][dbg],BYPG,BYPG));
540
+}
541
+
542
+////////////////////////////////////////////////////////////////////////////////
543
+
544
+  // EXTERNAL LOOP - 8 bytes per operation
545
+  for(i=0;i<8;i++){
546
+
547
+    DBG(fprintf(stderr,"--BEGIN EXTERNAL LOOP %i\n",i));
548
+
549
+#ifdef STREAM_INIT
550
+    for(b=0;b<4;b++){
551
+      in1[b]=sb_g[8*i+4+b];
552
+      in2[b]=sb_g[8*i+b];
553
+    }
554
+#endif
555
+
556
+    // INTERNAL LOOP - 2 bits per iteration
557
+    for(j=0; j<4; j++){
558
+
559
+      DBG(fprintf(stderr,"---BEGIN INTERNAL LOOP %i (EXT %i, INT %i)\n",j,i,j));
560
+
561
+      // from A0..A9, 35 bits are selected as inputs to 7 s-boxes
562
+      // 5 bits input per s-box, 2 bits output per s-box
563
+
564
+      // we can select bits with zero masking and shifting operations
565
+      // and synthetize s-boxes with optimized boolean functions.
566
+      // this is the actual reason we do all the crazy transposition
567
+      // stuff to switch between normal and bit slice representations.
568
+      // this code really flies.
569
+
570
+      fe=regs->A[aboff+3][0];fa=regs->A[aboff+0][2];fb=regs->A[aboff+5][1];fc=regs->A[aboff+6][3];fd=regs->A[aboff+8][0];
571
+/* 1000 1110  1110 0001   : lev  7: */ //tmp0=( fa^( fb^( ( ( ( fa|fb )^fc )|( fc^fd ) )^ALL_ONES ) ) );
572
+/* 1110 0010  0011 0011   : lev  6: */ //tmp1=( ( fa|fb )^( ( fc&( fa|( fb^fd ) ) )^ALL_ONES ) );
573
+/* 0011 0110  1000 1101   : lev  5: */ //tmp2=( fa^( ( fb&fd )^( ( fa&fd )|fc ) ) );
574
+/* 0101 0101  1001 0011   : lev  5: */ //tmp3=( ( fa&fc )^( fa^( ( fa&fb )|fd ) ) );
575
+/* 1000 1110  1110 0001   : lev  7: */ tmp0=FFXOR(fa,FFXOR(fb,FFXOR(FFOR(FFXOR(FFOR(fa,fb),fc),FFXOR(fc,fd)),FF1())));
576
+/* 1110 0010  0011 0011   : lev  6: */ tmp1=FFXOR(FFOR(fa,fb),FFXOR(FFAND(fc,FFOR(fa,FFXOR(fb,fd))),FF1()));
577
+/* 0011 0110  1000 1101   : lev  5: */ tmp2=FFXOR(fa,FFXOR(FFAND(fb,fd),FFOR(FFAND(fa,fd),fc)));
578
+/* 0101 0101  1001 0011   : lev  5: */ tmp3=FFXOR(FFAND(fa,fc),FFXOR(fa,FFOR(FFAND(fa,fb),fd)));
579
+      s1a=FFXOR(tmp0,FFAND(fe,tmp1));
580
+      s1b=FFXOR(tmp2,FFAND(fe,tmp3));
581
+//dump_mem("s1as1b-fe",&fe,BYPG,BYPG);
582
+//dump_mem("s1as1b-fa",&fa,BYPG,BYPG);
583
+//dump_mem("s1as1b-fb",&fb,BYPG,BYPG);
584
+//dump_mem("s1as1b-fc",&fc,BYPG,BYPG);
585
+//dump_mem("s1as1b-fd",&fd,BYPG,BYPG);
586
+
587
+      fe=regs->A[aboff+1][1];fa=regs->A[aboff+2][2];fb=regs->A[aboff+5][3];fc=regs->A[aboff+6][0];fd=regs->A[aboff+8][1];
588
+/* 1001 1110  0110 0001   : lev  6: */ //tmp0=( fa^( ( fb&( fc|fd ) )^( fc^( fd^ALL_ONES ) ) ) );
589
+/* 0000 0011  0111 1011   : lev  5: */ //tmp1=( ( fa&( fb^fd ) )|( ( fa|fb )&fc ) );
590
+/* 1100 0110  1101 0010   : lev  6: */ //tmp2=( ( fb&fd )^( ( fa&fd )|( fb^( fc^ALL_ONES ) ) ) );
591
+/* 0001 1110  1111 0101   : lev  5: */ //tmp3=( ( fa&fd )|( fa^( fb^( fc&fd ) ) ) );
592
+/* 1001 1110  0110 0001   : lev  6: */ tmp0=FFXOR(fa,FFXOR(FFAND(fb,FFOR(fc,fd)),FFXOR(fc,FFXOR(fd,FF1()))));
593
+/* 0000 0011  0111 1011   : lev  5: */ tmp1=FFOR(FFAND(fa,FFXOR(fb,fd)),FFAND(FFOR(fa,fb),fc));
594
+/* 1100 0110  1101 0010   : lev  6: */ tmp2=FFXOR(FFAND(fb,fd),FFOR(FFAND(fa,fd),FFXOR(fb,FFXOR(fc,FF1()))));
595
+/* 0001 1110  1111 0101   : lev  5: */ tmp3=FFOR(FFAND(fa,fd),FFXOR(fa,FFXOR(fb,FFAND(fc,fd))));
596
+      s2a=FFXOR(tmp0,FFAND(fe,tmp1));
597
+      s2b=FFXOR(tmp2,FFAND(fe,tmp3));
598
+
599
+      fe=regs->A[aboff+0][3];fa=regs->A[aboff+1][0];fb=regs->A[aboff+4][1];fc=regs->A[aboff+4][3];fd=regs->A[aboff+5][2];
600
+/* 0100 1011  1001 0110   : lev  5: */ //tmp0=( fa^( fb^( ( fc&( fa|fd ) )^fd ) ) );
601
+/* 1101 0101  1000 1100   : lev  7: */ //tmp1=( ( fa&fc )^( ( fa^fd )|( ( fb|fc )^( fd^ALL_ONES ) ) ) );
602
+/* 0010 0111  1101 1000   : lev  4: */ //tmp2=( fa^( ( ( fb^fc )&fd )^fc ) );
603
+/* 1111 1111  1111 1111   : lev  0: */ //tmp3=ALL_ONES;
604
+/* 0100 1011  1001 0110   : lev  5: */ tmp0=FFXOR(fa,FFXOR(fb,FFXOR(FFAND(fc,FFOR(fa,fd)),fd)));
605
+/* 1101 0101  1000 1100   : lev  7: */ tmp1=FFXOR(FFAND(fa,fc),FFOR(FFXOR(fa,fd),FFXOR(FFOR(fb,fc),FFXOR(fd,FF1()))));
606
+/* 0010 0111  1101 1000   : lev  4: */ tmp2=FFXOR(fa,FFXOR(FFAND(FFXOR(fb,fc),fd),fc));
607
+/* 1111 1111  1111 1111   : lev  0: */ tmp3=FF1();
608
+      s3a=FFXOR(tmp0,FFAND(FFNOT(fe),tmp1));
609
+      s3b=FFXOR(tmp2,FFAND(fe,tmp3));
610
+
611
+      fe=regs->A[aboff+2][3];fa=regs->A[aboff+0][1];fb=regs->A[aboff+1][3];fc=regs->A[aboff+3][2];fd=regs->A[aboff+7][0];
612
+/* 1011 0101  0100 1001   : lev  7: */ //tmp0=( fa^( ( fc&( fa^fd ) )|( fb^( fc|( fd^ALL_ONES ) ) ) ) );
613
+/* 0010 1101  0110 0110   : lev  6: */ //tmp1=( ( fa&fb )^( fb^( ( ( fa|fc )&fd )^fc ) ) );
614
+/* 0110 0111  1101 0000   : lev  7: */ //tmp2=( fa^( ( fb&fc )|( ( ( fa&( fb^fd ) )|fc )^fd ) ) );
615
+/* 1111 1111  1111 1111   : lev  0: */ //tmp3=ALL_ONES;
616
+/* 1011 0101  0100 1001   : lev  7: */ tmp0=FFXOR(fa,FFOR(FFAND(fc,FFXOR(fa,fd)),FFXOR(fb,FFOR(fc,FFXOR(fd,FF1())))));
617
+/* 0010 1101  0110 0110   : lev  6: */ tmp1=FFXOR(FFAND(fa,fb),FFXOR(fb,FFXOR(FFAND(FFOR(fa,fc),fd),fc)));
618
+/* 0110 0111  1101 0000   : lev  7: */ tmp2=FFXOR(fa,FFOR(FFAND(fb,fc),FFXOR(FFOR(FFAND(fa,FFXOR(fb,fd)),fc),fd)));
619
+/* 1111 1111  1111 1111   : lev  0: */ tmp3=FF1();
620
+      s4a=FFXOR(tmp0,FFAND(fe,FFXOR(tmp1,tmp0)));
621
+      s4b=FFXOR(FFXOR(s4a,tmp2),FFAND(fe,tmp3));
622
+
623
+      fe=regs->A[aboff+4][2];fa=regs->A[aboff+3][3];fb=regs->A[aboff+5][0];fc=regs->A[aboff+7][1];fd=regs->A[aboff+8][2];
624
+/* 1000 1111  0011 0010   : lev  7: */ //tmp0=( ( ( fa&( fb|fc ) )^fb )|( ( ( fa^fc )|fd )^ALL_ONES ) );
625
+/* 0110 1011  0000 1011   : lev  6: */ //tmp1=( fb^( ( fc^fd )&( fc^( fb|( fa^fd ) ) ) ) );
626
+/* 0001 1010  0111 1001   : lev  6: */ //tmp2=( ( fa&fc )^( fb^( ( fb|( fa^fc ) )&fd ) ) );
627
+/* 0101 1101  1101 0101   : lev  4: */ //tmp3=( ( ( fa^fb )&( fc^ALL_ONES ) )|fd );
628
+/* 1000 1111  0011 0010   : lev  7: */ tmp0=FFOR(FFXOR(FFAND(fa,FFOR(fb,fc)),fb),FFXOR(FFOR(FFXOR(fa,fc),fd),FF1()));
629
+/* 0110 1011  0000 1011   : lev  6: */ tmp1=FFXOR(fb,FFAND(FFXOR(fc,fd),FFXOR(fc,FFOR(fb,FFXOR(fa,fd)))));
630
+/* 0001 1010  0111 1001   : lev  6: */ tmp2=FFXOR(FFAND(fa,fc),FFXOR(fb,FFAND(FFOR(fb,FFXOR(fa,fc)),fd)));
631
+/* 0101 1101  1101 0101   : lev  4: */ tmp3=FFOR(FFAND(FFXOR(fa,fb),FFXOR(fc,FF1())),fd);
632
+      s5a=FFXOR(tmp0,FFAND(fe,tmp1));
633
+      s5b=FFXOR(tmp2,FFAND(fe,tmp3));
634
+
635
+      fe=regs->A[aboff+2][1];fa=regs->A[aboff+3][1];fb=regs->A[aboff+4][0];fc=regs->A[aboff+6][2];fd=regs->A[aboff+8][3];
636
+/* 0011 0110  0010 1101   : lev  6: */ //tmp0=( ( ( fa&fc )&fd )^( ( fb&( fa|fd ) )^fc ) );
637
+/* 1110 1110  1011 1011   : lev  3: */ //tmp1=( ( ( fa^fc )&fd )^ALL_ONES );
638
+/* 0101 1000  0110 0111   : lev  6: */ //tmp2=( ( fa&( fb|fc ) )^( fb^( ( fb&fc )|fd ) ) );
639
+/* 0001 0011  0000 0001   : lev  5: */ //tmp3=( fc&( ( fa&( fb^fd ) )^( fb|fd ) ) );
640
+/* 0011 0110  0010 1101   : lev  6: */ tmp0=FFXOR(FFAND(FFAND(fa,fc),fd),FFXOR(FFAND(fb,FFOR(fa,fd)),fc));
641
+/* 1110 1110  1011 1011   : lev  3: */ tmp1=FFXOR(FFAND(FFXOR(fa,fc),fd),FF1());
642
+/* 0101 1000  0110 0111   : lev  6: */ tmp2=FFXOR(FFAND(fa,FFOR(fb,fc)),FFXOR(fb,FFOR(FFAND(fb,fc),fd)));
643
+/* 0001 0011  0000 0001   : lev  5: */ tmp3=FFAND(fc,FFXOR(FFAND(fa,FFXOR(fb,fd)),FFOR(fb,fd)));
644
+      s6a=FFXOR(tmp0,FFAND(fe,tmp1));
645
+      s6b=FFXOR(tmp2,FFAND(fe,tmp3));
646
+
647
+      fe=regs->A[aboff+1][2];fa=regs->A[aboff+2][0];fb=regs->A[aboff+6][1];fc=regs->A[aboff+7][2];fd=regs->A[aboff+7][3];
648
+/* 0111 1000  1001 0110   : lev  5: */ //tmp0=( fb^( ( fc&fd )|( fa^( fc^fd ) ) ) );
649
+/* 0100 1001  0101 1011   : lev  6: */ //tmp1=( ( fb|fd )&( ( fa&fc )|( fb^( fc^fd ) ) ) );
650
+/* 0100 1001  1011 1001   : lev  5: */ //tmp2=( ( fa|fb )^( ( fc&( fb|fd ) )^fd ) );
651
+/* 1111 1111  1101 1101   : lev  3: */ //tmp3=( fd|( ( fa&fc )^ALL_ONES ) );
652
+/* 0111 1000  1001 0110   : lev  5: */ tmp0=FFXOR(fb,FFOR(FFAND(fc,fd),FFXOR(fa,FFXOR(fc,fd))));
653
+/* 0100 1001  0101 1011   : lev  6: */ tmp1=FFAND(FFOR(fb,fd),FFOR(FFAND(fa,fc),FFXOR(fb,FFXOR(fc,fd))));
654
+/* 0100 1001  1011 1001   : lev  5: */ tmp2=FFXOR(FFOR(fa,fb),FFXOR(FFAND(fc,FFOR(fb,fd)),fd));
655
+/* 1111 1111  1101 1101   : lev  3: */ tmp3=FFOR(fd,FFXOR(FFAND(fa,fc),FF1()));
656
+      s7a=FFXOR(tmp0,FFAND(fe,tmp1));
657
+      s7b=FFXOR(tmp2,FFAND(fe,tmp3));
658
+
659
+
660
+/*
661
+      we have just done this:
662
+      
663
+      int sbox1[0x20] = {2,0,1,1,2,3,3,0, 3,2,2,0,1,1,0,3, 0,3,3,0,2,2,1,1, 2,2,0,3,1,1,3,0};
664
+      int sbox2[0x20] = {3,1,0,2,2,3,3,0, 1,3,2,1,0,0,1,2, 3,1,0,3,3,2,0,2, 0,0,1,2,2,1,3,1};
665
+      int sbox3[0x20] = {2,0,1,2,2,3,3,1, 1,1,0,3,3,0,2,0, 1,3,0,1,3,0,2,2, 2,0,1,2,0,3,3,1};
666
+      int sbox4[0x20] = {3,1,2,3,0,2,1,2, 1,2,0,1,3,0,0,3, 1,0,3,1,2,3,0,3, 0,3,2,0,1,2,2,1};
667
+      int sbox5[0x20] = {2,0,0,1,3,2,3,2, 0,1,3,3,1,0,2,1, 2,3,2,0,0,3,1,1, 1,0,3,2,3,1,0,2};
668
+      int sbox6[0x20] = {0,1,2,3,1,2,2,0, 0,1,3,0,2,3,1,3, 2,3,0,2,3,0,1,1, 2,1,1,2,0,3,3,0};
669
+      int sbox7[0x20] = {0,3,2,2,3,0,0,1, 3,0,1,3,1,2,2,1, 1,0,3,3,0,1,1,2, 2,3,1,0,2,3,0,2};
670
+
671
+      s12 = sbox1[ (((A3>>0)&1)<<4) | (((A0>>2)&1)<<3) | (((A5>>1)&1)<<2) | (((A6>>3)&1)<<1) | (((A8>>0)&1)<<0) ]
672
+           |sbox2[ (((A1>>1)&1)<<4) | (((A2>>2)&1)<<3) | (((A5>>3)&1)<<2) | (((A6>>0)&1)<<1) | (((A8>>1)&1)<<0) ];
673
+      s34 = sbox3[ (((A0>>3)&1)<<4) | (((A1>>0)&1)<<3) | (((A4>>1)&1)<<2) | (((A4>>3)&1)<<1) | (((A5>>2)&1)<<0) ]
674
+           |sbox4[ (((A2>>3)&1)<<4) | (((A0>>1)&1)<<3) | (((A1>>3)&1)<<2) | (((A3>>2)&1)<<1) | (((A7>>0)&1)<<0) ];
675
+      s56 = sbox5[ (((A4>>2)&1)<<4) | (((A3>>3)&1)<<3) | (((A5>>0)&1)<<2) | (((A7>>1)&1)<<1) | (((A8>>2)&1)<<0) ]
676
+           |sbox6[ (((A2>>1)&1)<<4) | (((A3>>1)&1)<<3) | (((A4>>0)&1)<<2) | (((A6>>2)&1)<<1) | (((A8>>3)&1)<<0) ];
677
+      s7 =  sbox7[ (((A1>>2)&1)<<4) | (((A2>>0)&1)<<3) | (((A6>>1)&1)<<2) | (((A7>>2)&1)<<1) | (((A7>>3)&1)<<0) ];
678
+*/
679
+
680
+      // use 4x4 xor to produce extra nibble for T3
681
+
682
+      extra_B[3]=FFXOR(FFXOR(FFXOR(regs->B[aboff+2][0],regs->B[aboff+5][1]),regs->B[aboff+6][2]),regs->B[aboff+8][3]);
683
+      extra_B[2]=FFXOR(FFXOR(FFXOR(regs->B[aboff+5][0],regs->B[aboff+7][1]),regs->B[aboff+2][3]),regs->B[aboff+3][2]);
684
+      extra_B[1]=FFXOR(FFXOR(FFXOR(regs->B[aboff+4][3],regs->B[aboff+7][2]),regs->B[aboff+3][0]),regs->B[aboff+4][1]);
685
+      extra_B[0]=FFXOR(FFXOR(FFXOR(regs->B[aboff+8][2],regs->B[aboff+5][3]),regs->B[aboff+2][1]),regs->B[aboff+7][0]);
686
+for(dbg=0;dbg<4;dbg++){
687
+  DBG(fprintf(stderr,"extra_B[%i]=",dbg));
688
+  DBG(dump_mem("",(unsigned char *)&extra_B[dbg],BYPG,BYPG));
689
+}
690
+
691
+      // T1 = xor all inputs
692
+      // in1, in2, D are only used in T1 during initialisation, not generation
693
+      for(b=0;b<4;b++){
694
+        regs->A[aboff-1][b]=FFXOR(regs->A[aboff+9][b],regs->X[b]);
695
+      }
696
+
697
+#ifdef STREAM_INIT
698
+      for(b=0;b<4;b++){
699
+        regs->A[aboff-1][b]=FFXOR(FFXOR(regs->A[aboff-1][b],regs->D[b]),((j % 2) ? in2[b] : in1[b]));
700
+      }
701
+#endif
702
+
703
+for(dbg=0;dbg<4;dbg++){
704
+  DBG(fprintf(stderr,"next_A0[%i]=",dbg));
705
+  DBG(dump_mem("",(unsigned char *)&regs->A[aboff-1][dbg],BYPG,BYPG));
706
+}
707
+
708
+      // T2 =  xor all inputs
709
+      // in1, in2 are only used in T1 during initialisation, not generation
710
+      // if p=0, use this, if p=1, rotate the result left
711
+      for(b=0;b<4;b++){
712
+        regs->B[aboff-1][b]=FFXOR(FFXOR(regs->B[aboff+6][b],regs->B[aboff+9][b]),regs->Y[b]);
713
+      }
714
+
715
+#ifdef STREAM_INIT
716
+      for(b=0;b<4;b++){
717
+        regs->B[aboff-1][b]=FFXOR(regs->B[aboff-1][b],((j % 2) ? in1[b] : in2[b]));
718
+      }
719
+#endif
720
+
721
+for(dbg=0;dbg<4;dbg++){
722
+  DBG(fprintf(stderr,"next_B0[%i]=",dbg));
723
+  DBG(dump_mem("",(unsigned char *)&regs->B[aboff-1][dbg],BYPG,BYPG));
724
+}
725
+
726
+      // if p=1, rotate left (yes, this is what we're doing)
727
+      tmp3=regs->B[aboff-1][3];
728
+      regs->B[aboff-1][3]=FFXOR(regs->B[aboff-1][3],FFAND(FFXOR(regs->B[aboff-1][3],regs->B[aboff-1][2]),regs->p));
729
+      regs->B[aboff-1][2]=FFXOR(regs->B[aboff-1][2],FFAND(FFXOR(regs->B[aboff-1][2],regs->B[aboff-1][1]),regs->p));
730
+      regs->B[aboff-1][1]=FFXOR(regs->B[aboff-1][1],FFAND(FFXOR(regs->B[aboff-1][1],regs->B[aboff-1][0]),regs->p));
731
+      regs->B[aboff-1][0]=FFXOR(regs->B[aboff-1][0],FFAND(FFXOR(regs->B[aboff-1][0],tmp3),regs->p));
732
+
733
+for(dbg=0;dbg<4;dbg++){
734
+  DBG(fprintf(stderr,"next_B0[%i]=",dbg));
735
+  DBG(dump_mem("",(unsigned char *)&regs->B[aboff-1][dbg],BYPG,BYPG));
736
+}
737
+
738
+      // T3 = xor all inputs
739
+      for(b=0;b<4;b++){
740
+        regs->D[b]=FFXOR(FFXOR(regs->E[b],regs->Z[b]),extra_B[b]);
741
+      }
742
+
743
+for(dbg=0;dbg<4;dbg++){
744
+  DBG(fprintf(stderr,"D[%i]=",dbg));
745
+  DBG(dump_mem("",(unsigned char *)&regs->D[dbg],BYPG,BYPG));
746
+}
747
+
748
+      // T4 = sum, carry of Z + E + r
749
+      for(b=0;b<4;b++){
750
+        next_E[b]=regs->F[b];
751
+      }
752
+
753
+      tmp0=FFXOR(regs->Z[0],regs->E[0]);
754
+      tmp1=FFAND(regs->Z[0],regs->E[0]);
755
+      regs->F[0]=FFXOR(regs->E[0],FFAND(regs->q,FFXOR(regs->Z[0],regs->r)));
756
+      tmp3=FFAND(tmp0,regs->r);
757
+      tmp4=FFOR(tmp1,tmp3);
758
+
759
+      tmp0=FFXOR(regs->Z[1],regs->E[1]);
760
+      tmp1=FFAND(regs->Z[1],regs->E[1]);
761
+      regs->F[1]=FFXOR(regs->E[1],FFAND(regs->q,FFXOR(regs->Z[1],tmp4)));
762
+      tmp3=FFAND(tmp0,tmp4);
763
+      tmp4=FFOR(tmp1,tmp3);
764
+
765
+      tmp0=FFXOR(regs->Z[2],regs->E[2]);
766
+      tmp1=FFAND(regs->Z[2],regs->E[2]);
767
+      regs->F[2]=FFXOR(regs->E[2],FFAND(regs->q,FFXOR(regs->Z[2],tmp4)));
768
+      tmp3=FFAND(tmp0,tmp4);
769
+      tmp4=FFOR(tmp1,tmp3);
770
+
771
+      tmp0=FFXOR(regs->Z[3],regs->E[3]);
772
+      tmp1=FFAND(regs->Z[3],regs->E[3]);
773
+      regs->F[3]=FFXOR(regs->E[3],FFAND(regs->q,FFXOR(regs->Z[3],tmp4)));
774
+      tmp3=FFAND(tmp0,tmp4);
775
+      regs->r=FFXOR(regs->r,FFAND(regs->q,FFXOR(FFOR(tmp1,tmp3),regs->r))); // ultimate carry
776
+
777
+/*
778
+      we have just done this: (believe it or not)
779
+      
780
+      if (q) {
781
+        F = Z + E + r;
782
+        r = (F >> 4) & 1;
783
+        F = F & 0x0f;
784
+      }
785
+      else {
786
+          F = E;
787
+      }
788
+*/
789
+      for(b=0;b<4;b++){
790
+        regs->E[b]=next_E[b];
791
+      }
792
+for(dbg=0;dbg<4;dbg++){
793
+  DBG(fprintf(stderr,"F[%i]=",dbg));
794
+  DBG(dump_mem("",(unsigned char *)&regs->F[dbg],BYPG,BYPG));
795
+}
796
+DBG(fprintf(stderr,"r="));
797
+DBG(dump_mem("",(unsigned char *)&regs->r,BYPG,BYPG));
798
+for(dbg=0;dbg<4;dbg++){
799
+  DBG(fprintf(stderr,"E[%i]=",dbg));
800
+  DBG(dump_mem("",(unsigned char *)&regs->E[dbg],BYPG,BYPG));
801
+}
802
+
803
+      // this simple instruction is virtually shifting all the shift registers
804
+      aboff--;
805
+
806
+/*
807
+      we've just done this:
808
+
809
+      A9=A8;A8=A7;A7=A6;A6=A5;A5=A4;A4=A3;A3=A2;A2=A1;A1=A0;A0=next_A0;
810
+      B9=B8;B8=B7;B7=B6;B6=B5;B5=B4;B4=B3;B3=B2;B2=B1;B1=B0;B0=next_B0;
811
+*/
812
+
813
+      regs->X[0]=s1a;
814
+      regs->X[1]=s2a;
815
+      regs->X[2]=s3b;
816
+      regs->X[3]=s4b;
817
+      regs->Y[0]=s3a;
818
+      regs->Y[1]=s4a;
819
+      regs->Y[2]=s5b;
820
+      regs->Y[3]=s6b;
821
+      regs->Z[0]=s5a;
822
+      regs->Z[1]=s6a;
823
+      regs->Z[2]=s1b;
824
+      regs->Z[3]=s2b;
825
+      regs->p=s7a;
826
+      regs->q=s7b;
827
+for(dbg=0;dbg<4;dbg++){
828
+  DBG(fprintf(stderr,"X[%i]=",dbg));
829
+  DBG(dump_mem("",(unsigned char *)&regs->X[dbg],BYPG,BYPG));
830
+}
831
+for(dbg=0;dbg<4;dbg++){
832
+  DBG(fprintf(stderr,"Y[%i]=",dbg));
833
+  DBG(dump_mem("",(unsigned char *)&regs->Y[dbg],BYPG,BYPG));
834
+}
835
+for(dbg=0;dbg<4;dbg++){
836
+  DBG(fprintf(stderr,"Z[%i]=",dbg));
837
+  DBG(dump_mem("",(unsigned char *)&regs->Z[dbg],BYPG,BYPG));
838
+}
839
+DBG(fprintf(stderr,"p="));
840
+DBG(dump_mem("",(unsigned char *)&regs->p,BYPG,BYPG));
841
+DBG(fprintf(stderr,"q="));
842
+DBG(dump_mem("",(unsigned char *)&regs->q,BYPG,BYPG));
843
+
844
+#ifdef STREAM_NORMAL
845
+      // require 4 loops per output byte
846
+      // 2 output bits are a function of the 4 bits of D
847
+      // xor 2 by 2
848
+      cb_g[8*i+7-2*j]=FFXOR(regs->D[2],regs->D[3]);
849
+      cb_g[8*i+6-2*j]=FFXOR(regs->D[0],regs->D[1]);
850
+for(dbg=0;dbg<8;dbg++){
851
+  DBG(fprintf(stderr,"op[%i]=",dbg));
852
+  DBG(dump_mem("",(unsigned char *)&cb_g[8*i+dbg],BYPG,BYPG));
853
+}
854
+#endif
855
+
856
+DBG(fprintf(stderr,"---END INTERNAL LOOP\n"));
857
+
858
+    } // INTERNAL LOOP
859
+
860
+DBG(fprintf(stderr,"--END EXTERNAL LOOP\n"));
861
+
862
+  } // EXTERNAL LOOP
863
+
864
+  // move 32 steps forward, ready for next call
865
+  for(k=0;k<10;k++){
866
+    for(b=0;b<4;b++){
867
+DBG(fprintf(stderr,"moving forward AB k=%i b=%i\n",k,b));
868
+      regs->A[32+k][b]=regs->A[k][b];
869
+      regs->B[32+k][b]=regs->B[k][b];
870
+    }
871
+  }
872
+
873
+
874
+////////////////////////////////////////////////////////////////////////////////
875
+
876
+#ifdef STREAM_NORMAL
877
+for(j=0;j<64;j++){
878
+  DBG(fprintf(stderr,"postcall prerot cb[%2i]=",j));
879
+  DBG(dump_mem("",(unsigned char *)(cb+BYPG*j),BYPG,BYPG));
880
+}
881
+
882
+#if GROUP_PARALLELISM==32
883
+trasp64_32_88cw(cb);
884
+#endif
885
+#if GROUP_PARALLELISM==64
886
+trasp64_64_88cw(cb);
887
+#endif
888
+#if GROUP_PARALLELISM==128
889
+trasp64_128_88cw(cb);
890
+#endif
891
+
892
+for(j=0;j<64;j++){
893
+  DBG(fprintf(stderr,"postcall postrot cb[%2i]=",j));
894
+  DBG(dump_mem("",(unsigned char *)(cb+BYPG*j),BYPG,BYPG));
895
+}
896
+#endif
897
+
898
+#ifdef STREAM_INIT
899
+  DBG(fprintf(stderr,":::::::::: END STREAM INIT\n"));
900
+#endif
901
+#ifdef STREAM_NORMAL
902
+  DBG(fprintf(stderr,":::::::::: END STREAM NORMAL\n"));
903
+#endif
904
+
905
+}
906
+

Loading…
Cancel
Save